ThomasCederborg comments on A problem shared by many different alignment targets

ThomasCederborg 19 Jan 2025 4:41 UTC
3 points
0
Consider Bob, who takes morality very seriously. Bob thinks that any scenario where the fate of the world is determined by an unethical AI, is worse than the scenario with no AI. Bob sticks with this moral position, regardless of how much stuff Bob would get in a scenario with an unethical AI. For a mind as powerful as an AI, Bob considers it to be a moral imperative to ensure that heretics do not escape punishment. If a group contains at least one person like Bob (and at least one person that would strongly object to being punished), then the set of Pareto-improvements is empty. In a population of billions, there will always exist at least some people with Bob’s type of morality (and plenty of people that would strongly object to being punished). Which in turn means that for humanity, there exist no powerful AI, such that creating this AI would be a Pareto-improvement.
- Martin Randall 19 Jan 2025 16:38 UTC
  3 points
  0
  Parent
  The AI could deconstruct itself after creating twenty cakes, so then there is no unethical AI, but presumably Bob’s preferences refer to world-histories, not final-states.
  
  However, CEV is based on Bob’s extrapolated volition, and it seems like Bob would not maintain these preferences under extrapolation:
  - In the status quo, heretics are already unpunished—they each have one cake and no torture—so objecting to a non-torturing AI doesn’t make sense on that basis.
  - If there were no heretics, then Bob would not object to a non-torturing AI, so Bob’s preference against a non-torturing AI is an instrumental preference, not a fundamental preference.
  - Bob would be willing for a no-op AI to exist, in exchange for some amount of heretic-torture. So Bob can’t have an infinite preference against all non-torturing AIs.
  - Heresy may not have meaning in the extrapolated setting where everyone knows the true cosmology (whatever that is)
  - Bob tolerates the existence of other trade that improves the lives of both fanatics and heretics, so it’s unclear why the trade of creating an AI would be intolerable.
  The extrapolation of preferences could significantly reduce the moral variation in a population of billions. My different moral choices to others appear to be based largely on my experiences, including knowledge, analysis, and reflection. Those differences are extrapolated away. What is left is influences from my genetic priors and from the order I obtained knowledge. I’m not even proposing that extrapolation must cause Bob to stop valuing heretic-torture.
  
  If the extrapolation of preferences doesn’t cause Bob to stop valuing the existence of a non-torturing AI at negative infinity, I think that is fatal to all forms of CEV. The important thing then is to fail gracefully without creating a torture-AI.
  - ThomasCederborg 20 Jan 2025 3:12 UTC
    3 points
    0
    Parent
    Bob really does not want the fate of the world to be determined by an unethical AI. There is no reason for such a position to be instrumental. For Bob, this would be worse than the scenario with no AI (in the Davidad proposal, this is the baseline that is used to determine whether or not something is a Pareto-improvement). Both scenarios contain non-punished heretics. But only one scenario contains an unethical AI. Bob prefers the scenario without an unethical AI (for non-instrumental reasons).
    Regarding extrapolation:
    The question is whether or not at least one person will continue to view a non-punishing AI as unethical after extrapolation. (When determining whether or not something is a Pareto-improvement, the average fanatic is not necessarily relevant).
    Many people would indeed presumably change their minds regarding the morality of at least some things (for example when learning new facts). For the set of Pareto-improvements to be empty however, you only need two people: a single fanatic and a single heretic.
    In other words: for the set to be empty it is enough that a single person continues to view a single other person (that we can call Dave), as being deserving of punishment (in the sense that an AI has a moral obligation to punish Dave). The only missing component is then that Dave has to object strongly to being punished for being a heretic (this objection can actually also be entirely based on moral principles). Just two people out of billions need to take these moral positions for the set to be empty. And the building blocks that make up Bob’s morality are not actually particularly rare.
    The first building block of Bob’s morality is that of a moral imperative (the AI is seen as unethical for failing to fulfill its moral obligation to punish heretics). In other words: if someone finds themselves in a particular situation, then they are viewed as having a moral obligation to act in a certain way. Moral instincts along the lines of moral imperatives are fairly common. A trained firefighter might be seen as having important moral obligations if encountering a burning building with people inside. An armed police officer might be seen as having important moral obligations if encountering an active shooter. Similarly for soldiers, doctors, etc. Failing to fulfill an important moral obligation is fairly commonly seen as very bad.
    Let’s take Allan, who witnesses a crime being committed by Gregg. If the crime is very serious, and calling the police is risk free for Allan, then failing to call the police can be seen as a very serious moral outrage. If Allan does not fulfill this moral obligation, it would not be particularly unusual for someone to view Allan as deeply unethical. This general form of moral outrage is not rare. Not every form of morality includes contingent moral imperatives. But moralities that do include such imperatives are fairly common. There is obviously a lot of disagreements regarding who has what moral obligations. Just as there are disagreements regarding what should count as a crime. But the general moral instinct (that someone like Allan can be deeply unethical) is not exotic or strange.
    The obligation to punish bad people is also not particularly rare. Considering someone to be unethical because they get along with a bad person is not an exotic or rare type of moral instinct. It is not universal. But it is very common.
    And the specific moral position that heretics deserve to burn in hell is actually quite commonly expressed. We can argue about what percentage of people saying this actually means it. But surely we can agree that there exist at least some people that actually mean what they say.
    The final building block in Bob’s morality is objecting to having the fate of the world be determined by someone unethical. I don’t think that this is a particularly unusual thing to object to (on entirely non-instrumental grounds). Many people care deeply about how a given outcome is achieved.
    Some people that express positions along the lines of Bob might indeed back down if things get real. I think that for some people, survival instinct would in fact override any moral outrage. Especially if the non-AI scenario is really bad. Some fanatics would surely blink when coming face to face with any real danger. (And some people will probably abandon their entire moral framework in a heartbeat, the second someone offers them a really nice cake). But for at least some people, morality is genuinely important. And you only need one person like Bob, out of billions, for the set to be empty.
    So, if Bob is deeply attached to his moral framework. And the moral obligation to punish heretics is a core aspect of his morality. And this aspect of his morality is entirely built from ordinary and common types of moral instincts. Then an extrapolated version of Bob would only accept a non-punishing AI, if this extrapolation method has completely rewritten Bob’s entire moral framework (in ways that Bob would find horrific).
    - Martin Randall 25 Jan 2025 2:33 UTC
      3 points
      0
      Parent
      Summarizing Bob’s beliefs:
      
      Dave, who does not desire punishment, deserves punishment.
      Everyone is morally required to punish anyone who deserves punishment, if possible.
      Anyone who does not fulfill all moral requirements is unethical.
      It is morally forbidden to create an unethical agent that determines the fate of the world.
      There is no amount of goodness that can compensate for a single morally forbidden act.
      
      I think it’s possible (20%) that such blockers mean that there are no Pareto improvements. That’s enough by itself to motivate further research on alignment targets, aside from other reasons one might not like Pareto PCEV.
      
      However, three things make me think this is unlikely. Note that my (%) credences aren’t very stable or precise.
      
      Firstly, I think there is a chance (20%) that these beliefs don’t survive extrapolation, for example due to moral realism or coherence arguments. I agree that this means that Bob might find his extrapolated beliefs horrific. This is a risk with all CEV proposals.
      
      Secondly, I expect (50%) there are possible Pareto improvements that don’t go against these beliefs. For example, the PCEV could vote to create an AI that is unable to punish Dave and thus not morally required to punish Dave. Alternatively, instead of creating a Sovereign AI that determines the fate of the world, the PCEV could vote to create many human-level AIs that each improve the world without determining its fate.
      
      Thirdly, I expect (80%) some galaxy-brained solution to be implemented by the parliament of extrapolated minds who know everything and have reflected on it for eternity.
      - ThomasCederborg 30 Jan 2025 9:16 UTC
        3 points
        0
        Parent
        I’m sorry if the list below looks like nitpicking. But I really do think that these distinctions are important.
        Bob holds 1 as a value. Not as a belief.
        Bob does not hold 2 as a belief or as a value. Bob thinks that someone as powerful as the AI has an obligation to punish someone like Dave. But that is not the same as 2.
        Bob does not hold 3 as a belief or as a value. Bob thinks that for someone as powerful as the AI, the specific moral outrage in question renders the AI unethical. But that is not the same as 3.
        Bob does hold 4 as a value. But it is worth noting that 4 does not describe anything load-bearing. The thought experiment would still work even if Bob did not think that the act of creating an unethical agent that determines the fate of the world is morally forbidden. The load-bearing part is that Bob really does not want the fate of the world to be determined by an unethical AI (and thus prefers the scenario where this does not happen).
        Bob does not hold 5 as a belief or as a value. Bob prefers a scenario without an AI, to a scenario where the fate of the world was determined by an unethical AI. But that is not the same as 5. The description I gave of Bob does not in any way conflict with Bob thinking that most morally forbidden acts can be compensated for by expressing sincere regret at some later point in time. The description of Bob would even be consistent with Bob thinking that almost all morally forbidden acts can be compensated for by writing a big enough check. He just thinks that the specific moral outrage in question, directly means that the AI committing it is unethical. In other words: other actions are simply not taken into consideration, when going from this specific moral outrage, to the classification of the AI as unethical. (He also thinks that a scenario where the fate of the world is determined by an unethical AI is really bad. This opinion is also not taking any other aspects of the scenario into account. Perhaps this is what you were getting at with point 5).
        I insist on these distinctions because the moral framework that I was trying to describe, is importantly different from what is described by these points. The general type of moral sentiment that I was trying to describe is actually a very common, and also a very simple, type of moral sentiment. In other words: Bob’s morality is (i): far more common, (ii): far simpler, and (iii): far more stable, compared to the morality described by these points. Bob’s general type of moral sentiment can be described as: a specific moral outrage renders the person committing it unethical in a direct way. Not in a secondary way (meaning that there is for example no summing of any kind going on. There is no sense in which the moral outrage in question is in any way compared to any other set of actions. There is no sense in which any other action plays any part whatsoever when Bob classifies the AI as unethical).
        In yet other words: the link from this specific moral outrage to classification as unethical is direct. The AI doing nice things later is thus simply not related in any way to this classification. Plenty of classifications work like this. Allan will remain a murderer, no matter what he does after committing a murder. John will remain a military veteran, no matter what he does after his military service. Jeff will remain an Olympic gold winner, no matter what he does after winning that medal. Just as for Allan, John, and Jeff, the classification used to determine that the AI is unethical is simply not taking other actions into account.
        The classification is also not the result of any real chain of reasoning. There is no sense in which Bob first concludes that the moral outrage in question should be classified as morally forbidden, followed by Bob then deciding to adhere to a rule which states that all morally forbidden things should lead to the unethical classification (and Bob has no such a rule).
        This general type of moral sentiment is not universal. But it is quite common. Lots of people can think of at least one specific moral outrage that leads directly to them viewing a person committing it as unethical (at least when committed deliberately by a grownup that is informed, sober, mentally stable, etc). In other words: lots of people would be able to identify at least one specific moral outrage (perhaps out of a very large set of other moral outrages). And say that this specific moral outrage directly implies that the person is unethical. Different people obviously do not agree on which subset of all moral outrages should be treated like this (even people that agree on what should count as a moral outrage can feel differently about this). But the general sentiment where some specific moral outrage simply means that the person committing it is unethical is common.
        The main reason that I insist on the distinction is that this type of sentiment would be far more stable under reflection. There are no moving parts. There are no conditionals or calculations. Just a single, viscerally felt, implication. Attached directly to a specific moral outrage. For Bob, the specific moral outrage in question is a failure to adhere to the moral imperative to punish people like Dave.
        Strong objections to the fate of the world being determined by someone unethical are not universal. But this is neither complex nor particularly rare. Let’s add some details to make Bob’s values a bit easier to visualise. Bob has a concept that we can call a Dark Future. It is basically referring to scenarios where Bad People win The Power Struggle and manage to get enough power to choose the path of humanity (powerful anxieties along these lines seem quite common. And for a given individual it would not be at all surprising if something along these lines eventually turn into a deeply rooted, simple, and stable, intrinsic value).
        A scenario where the fate of the world is determined by an unethical AI is classified as a Dark Future (again in a direct way). For Bob, the case with no AI does not classify as a Dark Future. And Bob would really like to avoid a Dark Future. People who thinks that it is more important to prevent bad people from winning than to prevent the world from burning might not be very common. But there is nothing complex or incoherent about this position. And the general type of sentiment (that it matters a lot who gets to determine the fate of the word) seems to be very common. Not wanting Them to win can obviously be entirely instrumental. An intrinsic value might also be overpowered by survival instinct when things get real. But there is nothing surprising about something like this eventually solidifying into a deeply held intrinsic value. Bob does sound unusually bitter and inflexible. But there only needs to be one person like Bob in a population of billions.
        To summarise: a non punishing AI is directly classified as unethical. Additional details are simply not related in any way to this classification. A trajectory where an unethical AI determines the fate of humanity is classified as a Dark Future (again in a direct way). Bob finds a Dark Future to be worse than the no AI scenario. If someone were to specifically ask him, Bob might say that he would rather see the world to burn than see Them win. But if left alone to think about this, the world burning in the non-AI scenario is simply not the type of thing that is relevant to the choice (when the alternative is a Dark Future).
        Regarding the probability that extrapolation will change Bob:
        First I just want to again emphasise that the question is not if extrapolation will change one specific individual named Bob. The question is whether or not extrapolation will change everyone with these types of values. Some people might indeed change due to extrapolation.
        My main issue with the point about moral realism is that I don’t see why it would change anything (even if we only consider one specific individual, and also assume moral realism). I don’t see why discovering that The Objectively Correct Morality disagrees with Bob’s values would change anything (I strongly doubt that this sentence means anything. But for the rest of this paragraph I will reason from the assumption that it both does mean something, and that it is true). Unless Bob has very strong meta preferences related to this, the only difference would presumably be to rephrase everything in the terminology of Bob’s values. For example: extrapolated Bob would then really not want the fate of the world to be determined by an AI that is in strong conflict with Bob’s values (not punishing Dave directly implies a strong value conflict. The fate of the world being determined by someone with a strong value conflict directly implies a Dark Future. And nothing has changed regarding Bob’s attitude towards a Dark Future). As long as this is stronger than any meta preferences Bob might have regarding The Objectively Correct Morality, nothing important changes (Bob might end up needing a new word for someone that is in strong conflict with Bob’s values. But I don’t see why this would change Bob’s opinion regarding the relative desirability of a scenario that contains a non-punishing AI, compared to the scenario where there is no AI).
        I’m not sure what role coherence arguments would play here.
        Regarding successor AIs:
        It is the AI creating these successor AIs that is the problem for Bob (not the successor AIs themselves). The act of creating a successor AI that is unable to punish is morally equivalent to not punishing. It does not change anything. Similarly: the act of creating a lot of human level AIs is in itself determining the fate of the world (even if these successor AIs do not have the ability to determine the fate of the world).
        Regarding the last paragraph that talks about finding a clever solution:
        I’m not sure I understand this paragraph. I agree that if the set is not empty, then a clever AI will presumably find an action that is a Pareto Improvement. I am not saying that there exists an action that is a Pareto Improvement, but that this action is difficult to find. I am saying that at least one person will demand X and that at least one person will refuse X. Which means that a clever AI will just use its cleverness to confirm that the set is indeed empty.
        I’m not sure that the following is actually responding to something that you are saying (since I don’t know if I understand what you mean). But it seems relevant to point out that the Pareto constraint is part of the AIs goal definition. Which in turn means that before determining the members of the set of Pareto Improvements, there is no sense in which there exists a clever AI that is trying to make things work out well. In other words: there does not exist any clever AI, that has the goal of making the set non-empty. No one has, for example, an incentive to tweak the extrapolation definitions to make the set non-empty.
        Also: in the proposal in question, extrapolated delegates are presented with a set. Their role is then supposed to be to negotiate about actions in this set. I am saying that they will be presented with an empty set (produced by an AI that has no motivation to bend rules to make this set non-empty). If various coalitions of delegates are able to expand this set with clever tricks, then this would be a very different proposal (or a failure to implement the proposal in question). This alternative proposal would for example lack the protections for individuals, that the Pareto constraint is supposed to provide. Because the delegates of various types of fanatics could then also use clever tricks to expand the set of actions under consideration. The delegates of various factions of fanatics could then find clever ways of adding various ways of punishing heretics into the set of actions that are on the table during negotiations (which brings us back to the horrors implied by PCEV). Successful implementation of Pareto PCEV implies that the delegates are forced to abide by the various rules governing their negotiations (similar to how successful implementation of classical PCEV implies that the delegates have successfully been kept in the dark regarding how votes are actually settled).
        A few tangents:
        This last section is not a direct response to anything that you wrote. In particular, the points below are not meant as arguments against things that you have been advocating for. I just thought that this would be a good place to make a few points, that are related to the general topics that we are discussing in this thread (there is no post dedicated to Pareto PCEV, so this is a reasonable place to elaborate on some points related specifically to PPCEV).
        I think that if one only takes into account the opinions of a group that is small enough for a Pareto Improvement to exist, then the outcome would be completely dominated by people that are sort of like Bob, but that are just barely possible to bribe (for the same reason that PCEV is dominated by such people). The bribe would not primarily be about resources, but about what conditions various people should live under. I think that such an outcome would be worse than extinction from the perspective of many people that are not part of the group being taken into consideration (including from the perspective of people like Bob. But also from the perspective of people like Dave). And it would just barely be better than extinction for many in that group.
        I similarly think that if one takes the full population, but bend the rules until one gets a non-empty set of things that sort of looks close to Pareto Improvements, then the outcome will also be dominated by people like Bob (for the same reason that PCEV is dominated by people like Bob). Which in turn implies a worse-than-extinction outcome (in expectation, from the perspective of most individuals).
        In other words: I think that if one goes looking for coherent proposals that are sort of adjacent to this idea, then one would tend to find proposals that implies very bad outcomes. For the same reasons that proposals along the lines of PCEV implies very bad outcomes. A brief explanation of why I think this: if one tweaks this proposal until it refers to something coherent, then Steve has no meaningful influence regarding the adoption of those preferences that refer to Steve. Because when one is transforming this into something coherent, then Steve cannot retain influence over everything that he cares about strongly enough (as this would result in overlap). And there is nothing in this proposal that gives Steve any special influence regarding the adoption of those preferences that refer to Steve. Thus, in adjacent-but-coherent proposals, Steve will have no reason to expect that the resulting AI will want to help Steve, as opposed to want to hurt Steve.
        It might also be useful to zoom out a bit from the specific conflict between what Bob wants and what Dave wants. I think that it would be useful to view the Pareto constraint as many individual constraints. This set of constraints would include many hard constraints. In particular, it would include many trillions of hard individual-to-individual constraints (including constraints coming from a significant percentage of the global population, that have non-negotiable opinions regarding the fates of billions of other individuals). It is an equivalent but more useful way of representing the same thing. (In addition to being quite large, this set would also be very diverse. It would include hard constraints from many different kinds of non-standard minds. With many different kinds of non-standard ways of looking at things. And many different kinds of non-standard ontologies. Including many types of non-standard ontologies that the designers never considered). We can now describe alternative proposals where Steve gets a say regarding those constraints that only refer to Steve. If one is determined to start from Pareto PCEV, then I think that this is a much more promising path to explore (as opposed to exploring different ways of bending the rules until every single hard constraint can be simultaneously satisfied).
        I also think that it would be a very bad idea to go looking for an extrapolation dynamic that re-writes Bob’s values in a way that makes Bob stop wanting Dave to be punished (or that makes Bob bribable). I think that extrapolating Bob in an honest way, followed by giving Dave a say regarding those constraints that refer to Dave, is a more promising place to start looking for ways of keeping Dave safe from people like Bob. I for example think that this is less likely to result in unforeseen side effects (extrapolation is problematic enough without this type of added complexity. The option of designing different extrapolation dynamics for different groups of people is a bad option. The option of tweaking an extrapolation dynamic that will be used on everyone, with the intent of finding some mapping that will turn Bob into a safe person, is also a bad option).
        Martin Randall 1 Feb 2025 19:44 UTC
        3 points
        0
        Parent
        A lot to chew on in that comment.
        
        A baseline of “no superintelligence”
        
        I think I finally understand, sorry for the delay. The key thing I was not grasping is that Davidad proposed this baseline:
        
        The “random dictator” baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for “Pareto improvement” being “no superintelligence”). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.
        
        This makes Bob’s argument very simple:
        
        Creating a PPCEV AI causes a Dark Future. This is true even if the PPCEV AI no-ops, or creates a single cake. Bob can get here in many ways, as can Extrapolated-Bob.
        The baseline is no superintelligence, so no PPCEV AI, so not a Dark Future (in the same way).
        
        Option 2 is therefore better than option 1. Therefore there are no Pareto-improving proposals. Therefore the PPCEV AI no-ops. Even Bob is not happy about this, as it’s a Dark Future.
        
        I think this is 100% correct.
        
        An alternative baseline
        
        Let’s update Davidad’s proposal by setting the baseline to be whatever happens if the PPCEV AI emits a no-op. This means:
        
        Bob cannot object to a proposal because it implies the existence of PPCEV AI. The PPCEV AI already exists in the baseline.
        Bob needs to consider that if the PPCEV AI emits a no-op then whoever created it will likely try something else, or perhaps some other group will try something.
        Bob cannot object to a proposal because it implies that the PPCEV emits something. The PPCEV already emits something in the baseline.
        
        My logic is that if creating a PPCEV AI is a moral error (and perhaps it is) then at the point where the PPCEV AI is considering proposals then we already made that moral error. Since we can’t reverse the past error, we should consider proposals as they affect the future.
        
        This also avoids treating a no-op outcome as a special case. A no-op output is a proposal to be considered. It is always in the set of possible proposals, since it is never worse than the baseline, because it is the baseline.
        
        Do you think this modified proposal would still result in a no-op output?
        What links here?
        ThomasCederborg's comment on A problem shared by many different alignment targets by ThomasCederborg (3 Mar 2025 17:45 UTC; 1 point)
        ThomasCederborg 9 Feb 2025 20:23 UTC
        3 points
        0
        Parent
        There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). This is also a perfectly reasonable ontology. A single person like Bob2 is enough to make the set of Pareto Improvements relative to your proposed Pareto Baseline empty.
        (As a tangent, I just want to explicitly note here that this discussion is about Pareto Baselines. Not Negotiation Baselines. The negotiation baseline in all scenarios discussed in this exchange is still Yudkowsky’s proposed Random Dictator negotiation baseline. The Pareto Baseline is relevant to the set of actions under consideration in the Random Dictator negotiation baseline. But it is a distinct concept. I just wanted to make this explicit for the sake of any reader that is only skimming this exchange)
        The real thing that you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies (including a large number of non-standard ontologies. Some presumably a lot more strange than the ontologies of Bob and Bob2). The concept of a Pareto Improvement was really not designed to operate in a context like this. It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context. Few concepts have actually been properly explored in the AI context (this is related to the fact that the Random Dictator negotiation baseline actually works perfectly fine in the context that it was originally designed for: a single individual trying to deal with Moral Uncertainty. Something similar is also true for the Condorcet Criterion. The intuition failures that seem to happen when people move concepts from CEVI style mappings to CEVH style mappings is also related. Etc, etc, etc. It simply does not seem to exist a workable alternative, to actually exploring a concept, in whatever AI context that one wants to use it in. Simply importing concepts from other contexts, just does not seem to be a reliable way of doing things. This state of affairs is extremely inconvenient).
        Let’s consider the economist Erik, who claims that Erik’s Policy Modification (EPM) is a Pareto Improvement over current policy. Consider someone pointing out to Erik that some people want heretics to burn in hell, and that EPM would be bad for such people, since it would make life better for heretics in expectation. If Erik does decide to respond, he would presumably say something along the lines of: it is not the job of economic policy to satisfy people like this. He probably never explicitly decided to ignore such people. But his entire field is based on the assumption that such people do not need to be taken into consideration when outlining economic policy. When having a political argument about economic policy, such people are in fact not really an obstacle (if they do participate, they will presumably oppose EPM with arguments that do not mention hellfire). The implicit assumption that such positions can be ignored thus holds in the context of debating economic policy. But this assumption breaks when we move the concept to the AI context (where every single type of fanatic is informed, extrapolated, and actually given a very real, and absolute, veto over every single thing that is seen as important enough).
        Let’s look a bit at another Pareto Baseline that might make it easier to see the problem from a different angle (this thought experiment is also relevant to some straightforward ways in which one might further modify your proposed Pareto Baseline in response to Bob2). Consider the Unpleasant Pareto Baseline (UPB). In UPB the AI implements some approximation of everyone burning in hell (specifically: the AI makes everyone experience the sensation of being on fire for as long as it can). It turns out that it only takes two people to render the set of Pareto Improvements relative to UPB empty: Gregg and Jeff from my response to Davidad’s comment. Both want to hurt heretics, but they disagree about who is a heretic. Due to incompatibilities in their respective religions, every conceivable mind is seen as a heretic by at least one of them. Improving the situation of a heretic is Not Allowed. Improving the situation of any conceivable person, in any conceivable way, is thus making things worse from the perspective of at least one of them.
        Gregg and Jeff do have to be a lot more extreme than Bob or Bob2. They might for example be non-neurotypical (for example sharing a condition that has not yet been discovered). And raised in deeply religious environments, whose respective rules they have adopted in an extremely rigid way. So they are certainly rare. But there only needs to be two people like this for the set of Pareto Improvements relative to UPB to be empty. (presumably no one would ever consider building an AI with UPB as a Pareto Baseline. This thought experiment is not meant to illustrate any form of AI risk. It’s just a way of illustrating a point about attempting to simultaneously satisfy trillions of hard constraints, defined in billions of ontologies)
        (I really appreciate you engaging on this in such a thorough and well thought out manner. I don’t see this line of reasoning leading to anything along the lines of a workable patch or a usable Pareto Baseline. But I’m very happy to keep pulling on these threads, to see if one of them leads to some interesting insight. So by all means: please keep pulling on whatever loose ends you can see)
        Martin Randall 15 Feb 2025 20:28 UTC
        3 points
        0
        Parent
        I’m much less convinced by Bob2′s objections than by Bob1′s objections, so the modified baseline is better. I’m not saying it’s solved, but it no longer seems like the biggest problem.
        
        I completely agree that it’s important that “you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies”. On the other hand, the set of actions is potentially even larger, with septillions of reachable stars. My instinct is that this allows a large number of Pareto improvements, provided that the constraints are not pathological. The possibility of “utility inverters” (like Gregg and Jeff) is an example of pathological constraints.
        
        Utility Inverters
        
        I recently re-read What is malevolence? On the nature, measurement, and distribution of dark traits. Some findings:
        
        Over 16% of people agree or strongly agree that they “would like to make some people suffer even if it meant that I would go to hell with them”. Over 20% of people agree or strongly agree that they would take a punch to ensure someone they don’t like receives two punches.
        
        Such constraints don’t guarantee that there are no Pareto improvements, but they make it very likely, I agree. So what to do? In the article you propose Self Preference Adoption Decision Influence (SPADI), defined as “meaningful influence regarding the adoption of those preferences that refer to her”. We’ve come to a similar place by another route.
        
        There’s some benefit in coming from this angle, we’ve gained some focus on utility inversion as a problem. Some possible options:
        
        Remove utility inverting preferences in the coherently extrapolated delegates. We could call this Coherent Extrapolation of Equanimous Volition, for example. People can prefer that Dave stop cracking his knuckles, but can’t prefer that Dave suffer.
        Remove utility inverting preferences when evaluating whether options are pareto improvements. Actions cannot be rejected because they make Dave happier, but can be rejected because Dave cracking his knuckles makes others unhappier.
        
        I predict you won’t like this because of concerns like: what if Gregg just likes to see heretics burn, not because it makes the heretics suffer, but because it’s aesthetically pleasing to Gregg? No problem, the AI can have Gregg see many burning heretics, that’s just an augmented-reality mod, and if it’s truly an aesthetic preference then Gregg will be happy with that outcome.
        
        Pareto at Scale
        
        It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context.
        
        I don’t think we have to frame this as “the AI context”, I think the difference is more about scale. Would this count as Computational Social Choice? Might be interesting to do a literature search. I happened across Safe Pareto Improvements for Delegated Game Playing, which isn’t the right paper, but makes me hopeful of finding something more to the point. The paper also helped me realize that finding the result of a parliament is probably NP-hard.
        ThomasCederborg 3 Mar 2025 17:45 UTC
        1 point
        0
        Parent
        The proposals described in your points 1 and 2 are very different from any of the proposals that I am analysing in the post. I consider this to be a good thing. But I wanted to note explicitly that this discussion has now moved very far away from what was discussed in the post, and is best seen as a new discussion (a discussion that starts with the proposals described in your points 1 and 2). Making this clear is important, because it means that many points made in the post (and also earlier in this thread) do not apply to the class of proposals that we are now discussing.
        In particular: all alignment targets analysed in the post are Group AIs. But the alignment target described in your point 1: Coherent Extrapolation of Equanimous Volition (CEEV), is not a Group AI. Given that the primary focus of the post is to analyse the Group AI idea, the analysis of CEEV below is best seen as starting a completely new discussion. Among other things, this means that many arguments from the post about Group AIs will probably not apply to CEEV. (CEEV is still very bad for individuals. Because it is still the case that no individual has any meaningful influence regarding the way in which CEEV adopts those preferences that refer to her. One specific issue is that some CEEV delegates will still prefer outcomes where heretics are punished, because some delegates will still have an aversion to unethical AIs. The issue is described in detail in the last section of this comment).
        The rule for deciding which actions are available to Delegates during negotiations, described in your point 2, is also a large departure from anything discussed in the post. The described rule would accept actions, even though those actions would make things dramatically worse for some people. I think that this makes it a very different kind of rule, compared to Davidad’s proposed Pareto Improvement rule. The points that I made about Pareto Improvements in the post, and earlier in this thread, do not apply to this new class of rules. (The set of actions is still rendered empty by the rule described in your point 2, due to a large and varied set of hard constraints demanding that the AI must not be unethical. A single pair of such demands can render the set empty, by having incompatible views regarding what it means for an AI to be unethical. Some pairs of demands like this have nothing to do with utility inversion. The issue is described in detail in the next section of this comment).
        It also makes sense to explicitly note here that with the rule described in your point 2, you have now started to go down the path of removing entire classes of constraints from consideration (as opposed to going down the path of looking for new Pareto Baselines). Therefore, my statement that the path that you are exploring is unlikely to result in a non-empty set no longer applies. That statement was expressing doubt about finding a usable Pareto Baseline that would result in a non-empty set. But in my view you are now doing something very different (and far more interesting) than looking for a usable Pareto Baseline that would result in a non-empty set.
        I will spend most of this comment talking about the proposals described in your points 1 and 2. But let’s first try to wrap up the previous topics, starting with Bob2. Bob2 is only different from Bob in the sense that Bob2 does not see an AI that literally never acts as a person. I don’t see why Bob2′s way of looking at things would be strange or unusual. A thing that literally never acts can certainly be seen as a person. But it doesn’t have to be seen as a person. Both perspectives seem reasonable. These two different classifications are baked into a core value, related to the Dark Future concept. (In other words: Bob and Bob2 have different values. So there is no reason to think that learning new facts would make them agree on this point. Because there is no reason to think that learning new facts would change core values). In a population of billions, there will thus be plenty of people that share Bob2′s way of looking at such an AI. So if the AI is pointed at billions of humans, the set of Pareto Improvements will be rendered empty by people like Bob2 (relative to the alternative no-AI-action Pareto Baseline that you discussed here).
        Now let’s turn to your point about the size of the action space. Most of my previous points probably do not apply to rules that will ignore entire classes of constraints (such as the “pathological constraints” that you mention). In that case everything depends on how one defines this class of constraints. Rules that do ignore classes of constraints are discussed in the next section of this comment. However: for rules that do not ignore any constraints, the number of actions is not necessarily relevant (in other words: while we are still talking about Pareto Improvements, the number of actions is not necessarily relevant). One can roughly describe the issue as: If one constraint demands X. And another constraint refuses X. Then the set is empty. Regardless of the number of actions.
        I’m not sure whether or not there is any significant disagreement left on this issue. But I will still elaborate a bit more on how I see the original situation (the situation where pathological constraints are not ignored).
        One can say that everything is short circuited by the fact that humans often have very strong opinions about who should be in charge. (And there are many different types of ontologies that are compatible with such sentiments. Which means that we can expect a great variety in terms of what this implies regarding demands about the AI). Wanting the right type of person to be in charge can be instrumental. But it does not have to be instrumental. And there is nothing unusual about demanding things that are entirely symbolic. (In other words: there is nothing unusual about Dennis, who demands that The Person in Charge must do or value things that have no connection with the material situation of Dennis).
        This is not part of every ontology. But caring about who is in charge is a common human value (at least common enough for a population of billions to include a great variety of hard constraints related to this general type of sentiment). The number of actions does not help if one person rejects all trajectories where the person in charge is X. And another person rejects any trajectory unless the person in charge is X. (Combined with the classification of a trajectory that contains a non-misaligned and clever AI, that takes any action, as a trajectory where the first such AI is in charge). (I don’t know if we actually disagree on anything here. Perhaps you would classify all constraints along these lines as Pathological Constraints). (In the next section I will point out that while such incompatible pairs can be related to utility inversion. They do not have to be.)
        I will first discuss the proposal described in your point 2 in the next section, and then discuss the proposal described in your point 1 in the last section (because finding the set of actions that are available to delegates happens before delegates start negotiating).
        The rule for determining which set of actions will be included in negotiations between delegates
        The rule described in your point 2 still results in an empty set, for the same reason that Davidad’s original Pareto Improvement rule results in an empty set. The rule described in your point 2 still does not remove the problem of Bob from the original thought experiment of the post. Because the thing that Bob objects to is an unethical AI. The issue is not about Bob wanting to hurt Dave, or about Bob wanting to believe that the AI is ethical (or that Bob might want to believe that Dave is punished. Or that Bob might want to see Dave being punished). The issue is that Bob does not want the fate of humanity to be determined by an unethical AI.
        Demanded punishments also do not have to refer to Dave’s preferences. It can be the case that Gregg demands that Dave’s preferences are inverted. But it can also be the case that Gregg demands that Dave be subjected to some specific treatment (and this can be a treatment that Dave will categorically reject). There is nothing unexpected about a fanatic demanding that heretics be subjected to a specific type of treatment. It is not feasible to eliminate all “Problematic Constraints” along these lines by eliminating some specific list of constraint types (for example along the lines of: utility inverting constraints, or hostile constraints, or demands that people suffer). Which in combination with the fact that Dave still has no meaningful influence over those constraints that are about Dave, means that there is still nothing preventing someone from demanding that things happen to Dave, that Dave finds completely unacceptable. A single such constraint is sufficient for rendering the action space empty (regardless of the size of the action space).
        When analysing this type of rule it might actually be best to switch to a new type of person, that has not been part of my past thought experiments. Specifically: the issue with the rule described in your point 2 can also be illustrated using a thought experiment that does not involve any preferences that in any way refer to any human. The basic situation is that two people have incompatible demands regarding how an AI must interact with a specific sacred place or object, in order for the AI to be considered acceptable.
        Let’s take ancient Egyptian religion as an example in order to avoid contemporary politics. Consider Intef who was named after the Pharaoh who founded Middle Kingdom Egypt, and Ahmose who was named after the Pharaoh who founded New Kingdom Egypt. They both consider it to be a moral imperative to restore temples to their rightful state (if one has the power to do so). But they disagree on when Egyptian religion was right, and therefore disagree on what the AI must do to avoid being classified as unethical (in the sense of the Dark Future concept).
        Specifically: a Middle Kingdom temple was destroyed and the stones were used to build a New Kingdom temple. Later that temple was also destroyed. Intef considers it to be a moral imperative to use the stones to rebuild the older temple (if one has the ability to do so). And Ahmose considers it to be a moral imperative to use the same stones to rebuild the newer temple (if one has the ability to do so). Neither of them thinks that an unethical AI is acceptable (after the AI is classified as unethical the rest of the story follows the same path as the examples with Bob or Bob2). So the set would still be empty, even if a rule simply ignores every constraint that in any way refers to any human.
        Neither of these demands are in any way hostile (or vicious, or based in hate, or associated with malevolent people, or belligerent, or anything else along such lines). Neither of these demands is on its own problematic or unreasonable. On its own, either of these demands is in fact trivial to satisfy (the vast majority of people would presumably be perfectly ok with either option). And neither of these demands looks dangerous (nor would they result in an advantage in Parliamentarian Negotiations). Very few people would watch the world burn rather than let Intef use the original stones to rebuild his preferred temple. But it only takes one person like Ahmose to make the set of actions empty.
        Let’s go through another iteration and consider AI47 who uses a rule that ignores some additional constraints. When calculating whether or not an action can be used in delegate negotiations, AI47 ignores all preferences that (i): refer to AI47 (thus completely ignoring all demands that AI47 not be unethical), or (ii): refer to any human, or (iii): are dangerous, or (iv): are based on hate / bitterness / spite / ego / etc / etc, or (v): make demands that are unreasonable or difficult to satisfy. Let’s say that in the baseline trajectory that alternative trajectories are compared to, AI47 never acts. If AI47 never acts, then this would lead to someone eventually launching a misaligned AI that would destroy the temple stones (and also kill everyone).
        Intef and Ahmose both think that if a misaligned AI destroys the stones, then this counts as the stones being destroyed in an accident (comparable from a moral standpoint to the case where the stones are destroyed by an unpreventable natural disaster). Conditioned on a trajectory where the stones are not used to restore the right temple, both prefer a trajectory where the stones are destroyed by accident. (In addition to caring about the ethics of the AI that is in charge, they also care about the stones themselves.). And there is no way for a non-misaligned, clever AI (like AI47), to destroy the stones by accident (in a sense that they would consider to be equivalent to an unpreventable natural disaster). So the set is still empty.
        In other words: even though this is no longer an attempt to find a usable Pareto Baseline that simultaneously satisfies many trillions of hard constraints, a single pair of constraints can still make the set empty. And it is still an attempt to deal with a large set of hard constraints, defined in a great variety of ontologies. It is also still true that (in addition to constraints coming from people like Intef and Bob2) this set will also include constraints defined in many ontologies that we will not be able to foresee (including the ontologies of a great variety of non-neurotypical individuals, that have been exposed to a great variety of value systems during childhood). This is an unusual feature of the AI context (compared to other contexts that deal with human preferences). A preference defined in an ontology that no one ever imagined might exist, has no impact on debates about economic policy. But unless one simply states that a rule should ignore any preference that was not considered by the designers, then the quest to find a rule that actually implies a non-empty set, must deal with this highly unusual feature of the AI context.
        (Intef and Ahmose pose a lot more problems in this step, than they pose in the step where delegates are negotiating. In that later step, their delegates have no problematic advantage. Their delegates are also not trying to implement anything worse than extinction. This is probably why this type of person has not been part of any of my past thought experiments. I have not thought deeply about people like Intef and Ahmose)
        (There exists several contemporary examples of this general type of disagreements over sacred locations or objects. Even the specific example of reusing temple stones was a common behaviour in many different times and places. But the ancient Egyptians are the undisputed champions of temple stone reuse. And people nowadays don’t really have strong opinions regarding which version of ancient Egyptian religion is the right version. Which is why I think it makes sense to use this example)
        (I’m happy to keep exploring this issue. I would not be surprised if this line of inquiry leads to some interesting insight)
        (if you are looking for related literature, you might want to take a look at the Sen ``Paradox″ (depending on how one defines “pathological preferences”, they may or may not be related to “nosy preferences”))
        (Technical note: this discussion makes a series of very optimistic assumptions in order to focus on problems that remain despite these assumptions. For example assuming away a large number of very severe definitional issues. Reasoning from such assumptions does not make sense if one is arguing that a given proposal would work. But it does make sense when one is showing that a given proposal fails, even if one makes such optimistic assumptions. This point also applies to the next section)
        Coherent Extrapolation of Equanimous Volition (CEEV)
        Summary: In the CEEV proposal described in your point 1, many different types of fanatics would still be represented by delegates that want outcomes where heretics are punished. For example fanatics that would see a non-punishing AI as unethical. Which means that CEEV still suffers from the problem that was illustrated by the original PCEV thought experiment. In other words: having utility inverting preferences is one possible reason to want an outcome where heretics are punished. Such preferences would not be present in CEEV delegates. But another reason to want an outcome where heretics are punished is a general aversion to unethical AIs. Removing utility inverting preferences from CEEV delegates would not remove their aversion to unethical AIs. Yet another type of sentiment that would be passed on to CEEV delegates, is the case where someone would want heretics to be subjected to some specific type of treatment (simply because, all else equal, it would be sort of nice if the universe ended up like this). There are many other types of sentiments along these lines that would also be passed on to CEEV delegates (including a great variety of sentiments that we have no hope of comprehensively cataloguing). Which means that many different types of CEEV delegates would still want an outcome where heretics are hurt. All of those delegates would still have a very dramatic advantage in CEEV negotiations.
        Let’s start by noting that fanatics can gain a very dramatic negotiation advantage in delegate negotiations, without being nearly as determined as Gregg or Bob. Unlike the situation discussed in the previous section, in delegate negotiations people just need to weakly prefer an outcome where heretics are subjected to some very unpleasant treatment. In other words: people can gain a very dramatic negotiation advantage simply because they feel that (all else equal) it would be sort of nice to have some type of outcome, that for some reason involves bad things happening to heretics.
        There exists a great variety of reasons for why someone might have such sentiments. In other words: some types of fanatics might lose their negotiation advantage in CEEV. But many types of fanatics would retain their advantage (due to a great variety of preferences defined in a great variety of ontologies). Which in turn means that CEEV suffers from the same basic problem that PCEV suffers from.
        You mention the possibility that an AI might lie to a fanatic regarding what is happening. But a proposed outcome along such lines would change nothing. CEEV delegates representing fanatics that have an aversion to unethical AIs would for example have no reason to accept such an outcome. Because the preferences of the fanatics in question is not about their beliefs regarding unethical AIs. Their preferences are about unethical AIs.
        In addition to fanatics with an aversion to unethical AIs, we can also look at George, who wants heretics to be punished as a direct preference (without any involvement of preferences related to unethical AIs). George might for example want all heretics to be subjected to some specific treatment (demands that heretics be subjected to some specific treatment are not unusual). No need for anything complicated or deeply felt. George might simply feel that it would be sort of nice if the universe would be organised like this (all else equal).
        George could also want the details of the treatment to be worked out by a clever AI (without referring to any form of utility inversion or suffering. Or even referring in any way to any heretic, when specifying the details of the treatment). George might for example want all heretics to be put in whatever situation, that would make George feel the greatest amount of regret. In other words: this type of demand does not have to be related to any form of utility inversion. The details of the treatment that George would like heretics to be subjected to, does not even need to be determined by any form of reference to any heretic. In yet other words: there are many ways for fanatics along the lines of George to gain a very large negotiation advantage in CEEV. (The proposal that CEEV might lie to George about what is happening to heretics would change nothing. Because George’s preference is not about George’s beliefs.)
        The type of scenario that you describe, where George might want to see Dave being hurt, is not actually an issue here. Let’s look more generally at George’s preferences regarding George’s experiences, George’s beliefs, George’s world model, etc. None of those pose a problem in original PCEV (because they do not result in a negotiation advantage for George’s delegate). (We might not have any actual disagreement regarding these types of preferences. I just wanted to be clear about this point).
        From the perspective of Steve, the underlying issue with CEEV is that Steve still has no meaningful control over the way in which CEEV adopts those preferences that refer to Steve. Which in turn means that Steve still has no reason to think that CEEV will want to help Steve, as opposed to want to hurt Steve. This point would remain true even if one were to remove additional types of preferences from delegates.
        Eliminating some specific list of preference types (for example along the lines of: utility inverting preferences, or hostile preferences, or preferences that people suffer, etc) does not qualitatively change this situation. Because eliminating such a list of preference types does not result in Steve gaining meaningful influence regarding the adoption of those preferences that refer to Steve. Which in the case of Parliamentarian Negotiations means that delegates will still want to hurt Steve, for a great variety of reasons (for example due to sentiments along the lines of an aversion to unethical AIs. And also due to a long and varied list of other types of sentiments, that we have no hope of exhaustively cataloguing).
        In other words: all those delegates that (for reasons related to a great variety of sentiments) still want outcomes where people are subjected to horrific forms of treatment, will still have a very large negotiation advantage in CEEV. And such delegates will also have a very large negotiation advantage in any other proposal without the SPADI feature, that is based on the idea of eliminating some other specific list of preference types from delegates.
        Since this discussion is exploring hypotheticals (as a way of reaching new insights), I’m happy to keep looking at proposals without the SPADI feature. But given the stakes, I do want to make a tangential point regarding plans that are supposed to end with a successfully implemented AI without the SPADI feature (presumably as the end point of some larger plan that includes things along the lines of: an AI pause, augmented humans, an initial Limited AI, etc, etc).
        In other words: I am happy to keep analysing proposals without the SPADI feature. Because it is hard to predict what one will find when one is pulling on threads like this. And because analysing a dangerous proposal reduces the probability of it being implemented. But I also want to go on a tangent and explain why successfully implementing any AI without the SPADI feature would be extremely bad. And explicitly note that this is true regardless of which specific path one takes to such an AI. And also explicitly note that this is true, regardless of whether or not anyone manages to construct a specific thought experiment illustrating the exact way in which things go bad.
        Let’s look at a hypothetical future proposal to illustrate these two points. Let’s say that someone proposes a plan that is supposed to eventually lead to the implementation of an AI that gets its preferences from billions of humans. This AI does not have the SPADI feature. Now let’s say that this proposed alignment target avoids the specific issues illustrated by all existing thought experiments. Let’s further say that no one is able to construct a specific thought experiment that illustrates exactly how this novel alignment target proposal would lead to a bad outcome. The absence of a thought experiment that illustrates the specific path to a bad outcome, would not in any way shape or form imply that the resulting AI does not want to hurt Steve, if such a proposed plan is successfully implemented. In other words: since Steve will have no meaningful influence regarding the adoption of those preferences that refer to Steve, Steve will have no reason to expect the actual resulting AI to want to help Steve, as opposed to want to hurt Steve. PCEV implied a massively worse than extinction outcome, also before the specific problem was described (and PCEV spent a lot of years as a fairly popular proposal without anyone noticing the issue).
        In yet other words: the actual AI, that is actually implied, by some proposed set of definitions, can end up wanting to hurt Steve, regardless of whether or not someone is able to construct a thought experiment that illustrates the exact mechanism by which this AI will end up wanting to hurt Steve. Which in combination with the fact that Steve does not have any meaningful influence regarding the adoption of those preferences that refer to Steve, means that Steve has no reason to expect this AI to want to help Steve, as opposed to want to hurt Steve.
        In yet other words: the SPADI feature is far from sufficient for basic safety. But it really is necessary for basic safety. Which in turn means that if a proposed AI does not have the SPADI feature, then this AI is known to be extremely bad for human individuals in expectation (if successfully implemented). This is true with or without a specific thought experiment illustrating the specific mechanism that would lead to this AI wanting to hurt individuals. And it is true regardless of what path was taken to the successful implementation of such an AI. (Just wanted to be explicit about these points. Happy to keep analysisng proposals without the SPADI feature.)
        (you might also want to take a look at this post)

ThomasCederborg comments on A problem shared by many different alignment targets

Regarding extrapolation:

Regarding the probability that extrapolation will change Bob:

Regarding successor AIs:

Regarding the last paragraph that talks about finding a clever solution:

A few tangents:

A baseline of “no superintelligence”

An alternative baseline

Utility Inverters

Pareto at Scale

The rule for determining which set of actions will be included in negotiations between delegates

Coherent Extrapolation of Equanimous Volition (CEEV)