I think I finally understand, sorry for the delay. The key thing I was not grasping is that Davidad proposed this baseline:
The “random dictator” baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for “Pareto improvement” being “no superintelligence”). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.
This makes Bob’s argument very simple:
Creating a PPCEV AI causes a Dark Future. This is true even if the PPCEV AI no-ops, or creates a single cake. Bob can get here in many ways, as can Extrapolated-Bob.
The baseline is no superintelligence, so no PPCEV AI, so not a Dark Future (in the same way).
Option 2 is therefore better than option 1. Therefore there are no Pareto-improving proposals. Therefore the PPCEV AI no-ops. Even Bob is not happy about this, as it’s a Dark Future.
I think this is 100% correct.
An alternative baseline
Let’s update Davidad’s proposal by setting the baseline to be whatever happens if the PPCEV AI emits a no-op. This means:
Bob cannot object to a proposal because it implies the existence of PPCEV AI. The PPCEV AI already exists in the baseline.
Bob needs to consider that if the PPCEV AI emits a no-op then whoever created it will likely try something else, or perhaps some other group will try something.
Bob cannot object to a proposal because it implies that the PPCEV emits something. The PPCEV already emits something in the baseline.
My logic is that if creating a PPCEV AI is a moral error (and perhaps it is) then at the point where the PPCEV AI is considering proposals then we already made that moral error. Since we can’t reverse the past error, we should consider proposals as they affect the future.
This also avoids treating a no-op outcome as a special case. A no-op output is a proposal to be considered. It is always in the set of possible proposals, since it is never worse than the baseline, because it is the baseline.
Do you think this modified proposal would still result in a no-op output?
There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). This is also a perfectly reasonable ontology. A single person like Bob2 is enough to make the set of Pareto Improvements relative to your proposed Pareto Baseline empty.
(As a tangent, I just want to explicitly note here that this discussion is about Pareto Baselines. Not Negotiation Baselines. The negotiation baseline in all scenarios discussed in this exchange is still Yudkowsky’s proposed Random Dictator negotiation baseline. The Pareto Baseline is relevant to the set of actions under consideration in the Random Dictator negotiation baseline. But it is a distinct concept. I just wanted to make this explicit for the sake of any reader that is only skimming this exchange)
The real thing that you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies (including a large number of non-standard ontologies. Some presumably a lot more strange than the ontologies of Bob and Bob2). The concept of a Pareto Improvement was really not designed to operate in a context like this. It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context. Few concepts have actually been properly explored in the AI context (this is related to the fact that the Random Dictator negotiation baseline actually works perfectly fine in the context that it was originally designed for: a single individual trying to deal with Moral Uncertainty. Something similar is also true for the Condorcet Criterion. The intuition failures that seem to happen when people move concepts from CEVI style mappings to CEVH style mappings is also related. Etc, etc, etc. It simply does not seem to exist a workable alternative, to actually exploring a concept, in whatever AI context that one wants to use it in. Simply importing concepts from other contexts, just does not seem to be a reliable way of doing things. This state of affairs is extremely inconvenient).
Let’s consider the economist Erik, who claims that Erik’s Policy Modification (EPM) is a Pareto Improvement over current policy. Consider someone pointing out to Erik that some people want heretics to burn in hell, and that EPM would be bad for such people, since it would make life better for heretics in expectation. If Erik does decide to respond, he would presumably say something along the lines of: it is not the job of economic policy to satisfy people like this. He probably never explicitly decided to ignore such people. But his entire field is based on the assumption that such people do not need to be taken into consideration when outlining economic policy. When having a political argument about economic policy, such people are in fact not really an obstacle (if they do participate, they will presumably oppose EPM with arguments that do not mention hellfire). The implicit assumption that such positions can be ignored thus holds in the context of debating economic policy. But this assumption breaks when we move the concept to the AI context (where every single type of fanatic is informed, extrapolated, and actually given a very real, and absolute, veto over every single thing that is seen as important enough).
Let’s look a bit at another Pareto Baseline that might make it easier to see the problem from a different angle (this thought experiment is also relevant to some straightforward ways in which one might further modify your proposed Pareto Baseline in response to Bob2). Consider the Unpleasant Pareto Baseline (UPB). In UPB the AI implements some approximation of everyone burning in hell (specifically: the AI makes everyone experience the sensation of being on fire for as long as it can). It turns out that it only takes two people to render the set of Pareto Improvements relative to UPB empty: Gregg and Jeff from my response to Davidad’s comment. Both want to hurt heretics, but they disagree about who is a heretic. Due to incompatibilities in their respective religions, every conceivable mind is seen as a heretic by at least one of them. Improving the situation of a heretic is Not Allowed. Improving the situation of any conceivable person, in any conceivable way, is thus making things worse from the perspective of at least one of them.
Gregg and Jeff do have to be a lot more extreme than Bob or Bob2. They might for example be non-neurotypical (for example sharing a condition that has not yet been discovered). And raised in deeply religious environments, whose respective rules they have adopted in an extremely rigid way. So they are certainly rare. But there only needs to be two people like this for the set of Pareto Improvements relative to UPB to be empty. (presumably no one would ever consider building an AI with UPB as a Pareto Baseline. This thought experiment is not meant to illustrate any form of AI risk. It’s just a way of illustrating a point about attempting to simultaneously satisfy trillions of hard constraints, defined in billions of ontologies)
(I really appreciate you engaging on this in such a thorough and well thought out manner. I don’t see this line of reasoning leading to anything along the lines of a workable patch or a usable Pareto Baseline. But I’m very happy to keep pulling on these threads, to see if one of them leads to some interesting insight. So by all means: please keep pulling on whatever loose ends you can see)
I’m much less convinced by Bob2′s objections than by Bob1′s objections, so the modified baseline is better. I’m not saying it’s solved, but it no longer seems like the biggest problem.
I completely agree that it’s important that “you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies”. On the other hand, the set of actions is potentially even larger, with septillions of reachable stars. My instinct is that this allows a large number of Pareto improvements, provided that the constraints are not pathological. The possibility of “utility inverters” (like Gregg and Jeff) is an example of pathological constraints.
Over 16% of people agree or strongly agree that they “would like to make some people suffer even if it meant that I would go to hell with them”.
Over 20% of people agree or strongly agree that they would take a punch to ensure someone they don’t like receives two punches.
Such constraints don’t guarantee that there are no Pareto improvements, but they make it very likely, I agree. So what to do? In the article you propose Self Preference Adoption Decision Influence (SPADI), defined as “meaningful influence regarding the adoption of those preferences that refer to her”. We’ve come to a similar place by another route.
There’s some benefit in coming from this angle, we’ve gained some focus on utility inversion as a problem. Some possible options:
Remove utility inverting preferences in the coherently extrapolated delegates. We could call this Coherent Extrapolation of Equanimous Volition, for example. People can prefer that Dave stop cracking his knuckles, but can’t prefer that Dave suffer.
Remove utility inverting preferences when evaluating whether options are pareto improvements. Actions cannot be rejected because they make Dave happier, but can be rejected because Dave cracking his knuckles makes others unhappier.
I predict you won’t like this because of concerns like: what if Gregg just likes to see heretics burn, not because it makes the heretics suffer, but because it’s aesthetically pleasing to Gregg? No problem, the AI can have Gregg see many burning heretics, that’s just an augmented-reality mod, and if it’s truly an aesthetic preference then Gregg will be happy with that outcome.
Pareto at Scale
It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context.
I don’t think we have to frame this as “the AI context”, I think the difference is more about scale. Would this count as Computational Social Choice? Might be interesting to do a literature search. I happened across Safe Pareto Improvements for Delegated Game Playing, which isn’t the right paper, but makes me hopeful of finding something more to the point. The paper also helped me realize that finding the result of a parliament is probably NP-hard.
The proposals described in your points 1 and 2 are very different from any of the proposals that I am analysing in the post. I consider this to be a good thing. But I wanted to note explicitly that this discussion has now moved very far away from what was discussed in the post, and is best seen as a new discussion (a discussion that starts with the proposals described in your points 1 and 2). Making this clear is important, because it means that many points made in the post (and also earlier in this thread) do not apply to the class of proposals that we are now discussing.
In particular: all alignment targets analysed in the post are Group AIs. But the alignment target described in your point 1: Coherent Extrapolation of Equanimous Volition (CEEV), is not a Group AI. Given that the primary focus of the post is to analyse the Group AI idea, the analysis of CEEV below is best seen as starting a completely new discussion. Among other things, this means that many arguments from the post about Group AIs will probably not apply to CEEV. (CEEV is still very bad for individuals. Because it is still the case that no individual has any meaningful influence regarding the way in which CEEV adopts those preferences that refer to her. One specific issue is that some CEEV delegates will still prefer outcomes where heretics are punished, because some delegates will still have an aversion to unethical AIs. The issue is described in detail in the last section of this comment).
The rule for deciding which actions are available to Delegates during negotiations, described in your point 2, is also a large departure from anything discussed in the post. The described rule would accept actions, even though those actions would make things dramatically worse for some people. I think that this makes it a very different kind of rule, compared to Davidad’s proposed Pareto Improvement rule. The points that I made about Pareto Improvements in the post, and earlier in this thread, do not apply to this new class of rules. (The set of actions is still rendered empty by the rule described in your point 2, due to a large and varied set of hard constraints demanding that the AI must not be unethical. A single pair of such demands can render the set empty, by having incompatible views regarding what it means for an AI to be unethical. Some pairs of demands like this have nothing to do with utility inversion. The issue is described in detail in the next section of this comment).
It also makes sense to explicitly note here that with the rule described in your point 2, you have now started to go down the path of removing entire classes of constraints from consideration (as opposed to going down the path of looking for new Pareto Baselines). Therefore, my statement that the path that you are exploring is unlikely to result in a non-empty set no longer applies. That statement was expressing doubt about finding a usable Pareto Baseline that would result in a non-empty set. But in my view you are now doing something very different (and far more interesting) than looking for a usable Pareto Baseline that would result in a non-empty set.
I will spend most of this comment talking about the proposals described in your points 1 and 2. But let’s first try to wrap up the previous topics, starting with Bob2. Bob2 is only different from Bob in the sense that Bob2 does not see an AI that literally never acts as a person. I don’t see why Bob2′s way of looking at things would be strange or unusual. A thing that literally never acts can certainly be seen as a person. But it doesn’t have to be seen as a person. Both perspectives seem reasonable. These two different classifications are baked into a core value, related to the Dark Future concept. (In other words: Bob and Bob2 have different values. So there is no reason to think that learning new facts would make them agree on this point. Because there is no reason to think that learning new facts would change core values). In a population of billions, there will thus be plenty of people that share Bob2′s way of looking at such an AI. So if the AI is pointed at billions of humans, the set of Pareto Improvements will be rendered empty by people like Bob2 (relative to the alternative no-AI-action Pareto Baseline that you discussed here).
Now let’s turn to your point about the size of the action space. Most of my previous points probably do not apply to rules that will ignore entire classes of constraints (such as the “pathological constraints” that you mention). In that case everything depends on how one defines this class of constraints. Rules that do ignore classes of constraints are discussed in the next section of this comment. However: for rules that do not ignore any constraints, the number of actions is not necessarily relevant (in other words: while we are still talking about Pareto Improvements, the number of actions is not necessarily relevant). One can roughly describe the issue as: If one constraint demands X. And another constraint refuses X. Then the set is empty. Regardless of the number of actions.
I’m not sure whether or not there is any significant disagreement left on this issue. But I will still elaborate a bit more on how I see the original situation (the situation where pathological constraints are not ignored).
One can say that everything is short circuited by the fact that humans often have very strong opinions about who should be in charge. (And there are many different types of ontologies that are compatible with such sentiments. Which means that we can expect a great variety in terms of what this implies regarding demands about the AI). Wanting the right type of person to be in charge can be instrumental. But it does not have to be instrumental. And there is nothing unusual about demanding things that are entirely symbolic. (In other words: there is nothing unusual about Dennis, who demands that The Person in Charge must do or value things that have no connection with the material situation of Dennis).
This is not part of every ontology. But caring about who is in charge is a common human value (at least common enough for a population of billions to include a great variety of hard constraints related to this general type of sentiment). The number of actions does not help if one person rejects all trajectories where the person in charge is X. And another person rejects any trajectory unless the person in charge is X. (Combined with the classification of a trajectory that contains a non-misaligned and clever AI, that takes any action, as a trajectory where the first such AI is in charge). (I don’t know if we actually disagree on anything here. Perhaps you would classify all constraints along these lines as Pathological Constraints). (In the next section I will point out that while such incompatible pairs can be related to utility inversion. They do not have to be.)
I will first discuss the proposal described in your point 2 in the next section, and then discuss the proposal described in your point 1 in the last section (because finding the set of actions that are available to delegates happens before delegates start negotiating).
The rule for determining which set of actions will be included in negotiations between delegates
The rule described in your point 2 still results in an empty set, for the same reason that Davidad’s original Pareto Improvement rule results in an empty set. The rule described in your point 2 still does not remove the problem of Bob from the original thought experiment of the post. Because the thing that Bob objects to is an unethical AI. The issue is not about Bob wanting to hurt Dave, or about Bob wanting to believe that the AI is ethical (or that Bob might want to believe that Dave is punished. Or that Bob might want to see Dave being punished). The issue is that Bob does not want the fate of humanity to be determined by an unethical AI.
Demanded punishments also do not have to refer to Dave’s preferences. It can be the case that Gregg demands that Dave’s preferences are inverted. But it can also be the case that Gregg demands that Dave be subjected to some specific treatment (and this can be a treatment that Dave will categorically reject). There is nothing unexpected about a fanatic demanding that heretics be subjected to a specific type of treatment. It is not feasible to eliminate all “Problematic Constraints” along these lines by eliminating some specific list of constraint types (for example along the lines of: utility inverting constraints, or hostile constraints, or demands that people suffer). Which in combination with the fact that Dave still has no meaningful influence over those constraints that are about Dave, means that there is still nothing preventing someone from demanding that things happen to Dave, that Dave finds completely unacceptable. A single such constraint is sufficient for rendering the action space empty (regardless of the size of the action space).
When analysing this type of rule it might actually be best to switch to a new type of person, that has not been part of my past thought experiments. Specifically: the issue with the rule described in your point 2 can also be illustrated using a thought experiment that does not involve any preferences that in any way refer to any human. The basic situation is that two people have incompatible demands regarding how an AI must interact with a specific sacred place or object, in order for the AI to be considered acceptable.
Let’s take ancient Egyptian religion as an example in order to avoid contemporary politics. Consider Intef who was named after the Pharaoh who founded Middle Kingdom Egypt, and Ahmose who was named after the Pharaoh who founded New Kingdom Egypt. They both consider it to be a moral imperative to restore temples to their rightful state (if one has the power to do so). But they disagree on when Egyptian religion was right, and therefore disagree on what the AI must do to avoid being classified as unethical (in the sense of the Dark Future concept).
Specifically: a Middle Kingdom temple was destroyed and the stones were used to build a New Kingdom temple. Later that temple was also destroyed. Intef considers it to be a moral imperative to use the stones to rebuild the older temple (if one has the ability to do so). And Ahmose considers it to be a moral imperative to use the same stones to rebuild the newer temple (if one has the ability to do so). Neither of them thinks that an unethical AI is acceptable (after the AI is classified as unethical the rest of the story follows the same path as the examples with Bob or Bob2). So the set would still be empty, even if a rule simply ignores every constraint that in any way refers to any human.
Neither of these demands are in any way hostile (or vicious, or based in hate, or associated with malevolent people, or belligerent, or anything else along such lines). Neither of these demands is on its own problematic or unreasonable. On its own, either of these demands is in fact trivial to satisfy (the vast majority of people would presumably be perfectly ok with either option). And neither of these demands looks dangerous (nor would they result in an advantage in Parliamentarian Negotiations). Very few people would watch the world burn rather than let Intef use the original stones to rebuild his preferred temple. But it only takes one person like Ahmose to make the set of actions empty.
Let’s go through another iteration and consider AI47 who uses a rule that ignores some additional constraints. When calculating whether or not an action can be used in delegate negotiations, AI47 ignores all preferences that (i): refer to AI47 (thus completely ignoring all demands that AI47 not be unethical), or (ii): refer to any human, or (iii): are dangerous, or (iv): are based on hate / bitterness / spite / ego / etc / etc, or (v): make demands that are unreasonable or difficult to satisfy. Let’s say that in the baseline trajectory that alternative trajectories are compared to, AI47 never acts. If AI47 never acts, then this would lead to someone eventually launching a misaligned AI that would destroy the temple stones (and also kill everyone).
Intef and Ahmose both think that if a misaligned AI destroys the stones, then this counts as the stones being destroyed in an accident (comparable from a moral standpoint to the case where the stones are destroyed by an unpreventable natural disaster). Conditioned on a trajectory where the stones are not used to restore the right temple, both prefer a trajectory where the stones are destroyed by accident. (In addition to caring about the ethics of the AI that is in charge, they also care about the stones themselves.). And there is no way for a non-misaligned, clever AI (like AI47), to destroy the stones by accident (in a sense that they would consider to be equivalent to an unpreventable natural disaster). So the set is still empty.
In other words: even though this is no longer an attempt to find a usable Pareto Baseline that simultaneously satisfies many trillions of hard constraints, a single pair of constraints can still make the set empty. And it is still an attempt to deal with a large set of hard constraints, defined in a great variety of ontologies. It is also still true that (in addition to constraints coming from people like Intef and Bob2) this set will also include constraints defined in many ontologies that we will not be able to foresee (including the ontologies of a great variety of non-neurotypical individuals, that have been exposed to a great variety of value systems during childhood). This is an unusual feature of the AI context (compared to other contexts that deal with human preferences). A preference defined in an ontology that no one ever imagined might exist, has no impact on debates about economic policy. But unless one simply states that a rule should ignore any preference that was not considered by the designers, then the quest to find a rule that actually implies a non-empty set, must deal with this highly unusual feature of the AI context.
(Intef and Ahmose pose a lot more problems in this step, than they pose in the step where delegates are negotiating. In that later step, their delegates have no problematic advantage. Their delegates are also not trying to implement anything worse than extinction. This is probably why this type of person has not been part of any of my past thought experiments. I have not thought deeply about people like Intef and Ahmose)
(There exists several contemporary examples of this general type of disagreements over sacred locations or objects. Even the specific example of reusing temple stones was a common behaviour in many different times and places. But the ancient Egyptians are the undisputed champions of temple stone reuse. And people nowadays don’t really have strong opinions regarding which version of ancient Egyptian religion is the right version. Which is why I think it makes sense to use this example)
(I’m happy to keep exploring this issue. I would not be surprised if this line of inquiry leads to some interesting insight)
(if you are looking for related literature, you might want to take a look at the Sen ``Paradox″ (depending on how one defines “pathological preferences”, they may or may not be related to “nosy preferences”))
(Technical note: this discussion makes a series of very optimistic assumptions in order to focus on problems that remain despite these assumptions. For example assuming away a large number of very severe definitional issues. Reasoning from such assumptions does not make sense if one is arguing that a given proposal would work. But it does make sense when one is showing that a given proposal fails, even if one makes such optimistic assumptions. This point also applies to the next section)
Coherent Extrapolation of Equanimous Volition (CEEV)
Summary: In the CEEV proposal described in your point 1, many different types of fanatics would still be represented by delegates that want outcomes where heretics are punished. For example fanatics that would see a non-punishing AI as unethical. Which means that CEEV still suffers from the problem that was illustrated by the original PCEV thought experiment. In other words: having utility inverting preferences is one possible reason to want an outcome where heretics are punished. Such preferences would not be present in CEEV delegates. But another reason to want an outcome where heretics are punished is a general aversion to unethical AIs. Removing utility inverting preferences from CEEV delegates would not remove their aversion to unethical AIs. Yet another type of sentiment that would be passed on to CEEV delegates, is the case where someone would want heretics to be subjected to some specific type of treatment (simply because, all else equal, it would be sort of nice if the universe ended up like this). There are many other types of sentiments along these lines that would also be passed on to CEEV delegates (including a great variety of sentiments that we have no hope of comprehensively cataloguing). Which means that many different types of CEEV delegates would still want an outcome where heretics are hurt. All of those delegates would still have a very dramatic advantage in CEEV negotiations.
Let’s start by noting that fanatics can gain a very dramatic negotiation advantage in delegate negotiations, without being nearly as determined as Gregg or Bob. Unlike the situation discussed in the previous section, in delegate negotiations people just need to weakly prefer an outcome where heretics are subjected to some very unpleasant treatment. In other words: people can gain a very dramatic negotiation advantage simply because they feel that (all else equal) it would be sort of nice to have some type of outcome, that for some reason involves bad things happening to heretics.
There exists a great variety of reasons for why someone might have such sentiments. In other words: some types of fanatics might lose their negotiation advantage in CEEV. But many types of fanatics would retain their advantage (due to a great variety of preferences defined in a great variety of ontologies). Which in turn means that CEEV suffers from the same basic problem that PCEV suffers from.
You mention the possibility that an AI might lie to a fanatic regarding what is happening. But a proposed outcome along such lines would change nothing. CEEV delegates representing fanatics that have an aversion to unethical AIs would for example have no reason to accept such an outcome. Because the preferences of the fanatics in question is not about their beliefs regarding unethical AIs. Their preferences are about unethical AIs.
In addition to fanatics with an aversion to unethical AIs, we can also look at George, who wants heretics to be punished as a direct preference (without any involvement of preferences related to unethical AIs). George might for example want all heretics to be subjected to some specific treatment (demands that heretics be subjected to some specific treatment are not unusual). No need for anything complicated or deeply felt. George might simply feel that it would be sort of nice if the universe would be organised like this (all else equal).
George could also want the details of the treatment to be worked out by a clever AI (without referring to any form of utility inversion or suffering. Or even referring in any way to any heretic, when specifying the details of the treatment). George might for example want all heretics to be put in whatever situation, that would make George feel the greatest amount of regret. In other words: this type of demand does not have to be related to any form of utility inversion. The details of the treatment that George would like heretics to be subjected to, does not even need to be determined by any form of reference to any heretic. In yet other words: there are many ways for fanatics along the lines of George to gain a very large negotiation advantage in CEEV. (The proposal that CEEV might lie to George about what is happening to heretics would change nothing. Because George’s preference is not about George’s beliefs.)
The type of scenario that you describe, where George might want to see Dave being hurt, is not actually an issue here. Let’s look more generally at George’s preferences regarding George’s experiences, George’s beliefs, George’s world model, etc. None of those pose a problem in original PCEV (because they do not result in a negotiation advantage for George’s delegate). (We might not have any actual disagreement regarding these types of preferences. I just wanted to be clear about this point).
From the perspective of Steve, the underlying issue with CEEV is that Steve still has no meaningful control over the way in which CEEV adopts those preferences that refer to Steve. Which in turn means that Steve still has no reason to think that CEEV will want to help Steve, as opposed to want to hurt Steve. This point would remain true even if one were to remove additional types of preferences from delegates.
Eliminating some specific list of preference types (for example along the lines of: utility inverting preferences, or hostile preferences, or preferences that people suffer, etc) does not qualitatively change this situation. Because eliminating such a list of preference types does not result in Steve gaining meaningful influence regarding the adoption of those preferences that refer to Steve. Which in the case of Parliamentarian Negotiations means that delegates will still want to hurt Steve, for a great variety of reasons (for example due to sentiments along the lines of an aversion to unethical AIs. And also due to a long and varied list of other types of sentiments, that we have no hope of exhaustively cataloguing).
In other words: all those delegates that (for reasons related to a great variety of sentiments) still want outcomes where people are subjected to horrific forms of treatment, will still have a very large negotiation advantage in CEEV. And such delegates will also have a very large negotiation advantage in any other proposal without the SPADI feature, that is based on the idea of eliminating some other specific list of preference types from delegates.
Since this discussion is exploring hypotheticals (as a way of reaching new insights), I’m happy to keep looking at proposals without the SPADI feature. But given the stakes, I do want to make a tangential point regarding plans that are supposed to end with a successfully implemented AI without the SPADI feature (presumably as the end point of some larger plan that includes things along the lines of: an AI pause, augmented humans, an initial Limited AI, etc, etc).
In other words: I am happy to keep analysing proposals without the SPADI feature. Because it is hard to predict what one will find when one is pulling on threads like this. And because analysing a dangerous proposal reduces the probability of it being implemented. But I also want to go on a tangent and explain why successfully implementing any AI without the SPADI feature would be extremely bad. And explicitly note that this is true regardless of which specific path one takes to such an AI. And also explicitly note that this is true, regardless of whether or not anyone manages to construct a specific thought experiment illustrating the exact way in which things go bad.
Let’s look at a hypothetical future proposal to illustrate these two points. Let’s say that someone proposes a plan that is supposed to eventually lead to the implementation of an AI that gets its preferences from billions of humans. This AI does not have the SPADI feature. Now let’s say that this proposed alignment target avoids the specific issues illustrated by all existing thought experiments. Let’s further say that no one is able to construct a specific thought experiment that illustrates exactly how this novel alignment target proposal would lead to a bad outcome. The absence of a thought experiment that illustrates the specific path to a bad outcome, would not in any way shape or form imply that the resulting AI does not want to hurt Steve, if such a proposed plan is successfully implemented. In other words: since Steve will have no meaningful influence regarding the adoption of those preferences that refer to Steve, Steve will have no reason to expect the actual resulting AI to want to help Steve, as opposed to want to hurt Steve. PCEV implied a massively worse than extinction outcome, also before the specific problem was described (and PCEV spent a lot of years as a fairly popular proposal without anyone noticing the issue).
In yet other words: the actual AI, that is actually implied, by some proposed set of definitions, can end up wanting to hurt Steve, regardless of whether or not someone is able to construct a thought experiment that illustrates the exact mechanism by which this AI will end up wanting to hurt Steve. Which in combination with the fact that Steve does not have any meaningful influence regarding the adoption of those preferences that refer to Steve, means that Steve has no reason to expect this AI to want to help Steve, as opposed to want to hurt Steve.
In yet other words: the SPADI feature is far from sufficient for basic safety. But it really is necessary for basic safety. Which in turn means that if a proposed AI does not have the SPADI feature, then this AI is known to be extremely bad for human individuals in expectation (if successfully implemented). This is true with or without a specific thought experiment illustrating the specific mechanism that would lead to this AI wanting to hurt individuals. And it is true regardless of what path was taken to the successful implementation of such an AI. (Just wanted to be explicit about these points. Happy to keep analysisng proposals without the SPADI feature.)
A lot to chew on in that comment.
A baseline of “no superintelligence”
I think I finally understand, sorry for the delay. The key thing I was not grasping is that Davidad proposed this baseline:
This makes Bob’s argument very simple:
Creating a PPCEV AI causes a Dark Future. This is true even if the PPCEV AI no-ops, or creates a single cake. Bob can get here in many ways, as can Extrapolated-Bob.
The baseline is no superintelligence, so no PPCEV AI, so not a Dark Future (in the same way).
Option 2 is therefore better than option 1. Therefore there are no Pareto-improving proposals. Therefore the PPCEV AI no-ops. Even Bob is not happy about this, as it’s a Dark Future.
I think this is 100% correct.
An alternative baseline
Let’s update Davidad’s proposal by setting the baseline to be whatever happens if the PPCEV AI emits a no-op. This means:
Bob cannot object to a proposal because it implies the existence of PPCEV AI. The PPCEV AI already exists in the baseline.
Bob needs to consider that if the PPCEV AI emits a no-op then whoever created it will likely try something else, or perhaps some other group will try something.
Bob cannot object to a proposal because it implies that the PPCEV emits something. The PPCEV already emits something in the baseline.
My logic is that if creating a PPCEV AI is a moral error (and perhaps it is) then at the point where the PPCEV AI is considering proposals then we already made that moral error. Since we can’t reverse the past error, we should consider proposals as they affect the future.
This also avoids treating a no-op outcome as a special case. A no-op output is a proposal to be considered. It is always in the set of possible proposals, since it is never worse than the baseline, because it is the baseline.
Do you think this modified proposal would still result in a no-op output?
There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). This is also a perfectly reasonable ontology. A single person like Bob2 is enough to make the set of Pareto Improvements relative to your proposed Pareto Baseline empty.
(As a tangent, I just want to explicitly note here that this discussion is about Pareto Baselines. Not Negotiation Baselines. The negotiation baseline in all scenarios discussed in this exchange is still Yudkowsky’s proposed Random Dictator negotiation baseline. The Pareto Baseline is relevant to the set of actions under consideration in the Random Dictator negotiation baseline. But it is a distinct concept. I just wanted to make this explicit for the sake of any reader that is only skimming this exchange)
The real thing that you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies (including a large number of non-standard ontologies. Some presumably a lot more strange than the ontologies of Bob and Bob2). The concept of a Pareto Improvement was really not designed to operate in a context like this. It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context. Few concepts have actually been properly explored in the AI context (this is related to the fact that the Random Dictator negotiation baseline actually works perfectly fine in the context that it was originally designed for: a single individual trying to deal with Moral Uncertainty. Something similar is also true for the Condorcet Criterion. The intuition failures that seem to happen when people move concepts from CEVI style mappings to CEVH style mappings is also related. Etc, etc, etc. It simply does not seem to exist a workable alternative, to actually exploring a concept, in whatever AI context that one wants to use it in. Simply importing concepts from other contexts, just does not seem to be a reliable way of doing things. This state of affairs is extremely inconvenient).
Let’s consider the economist Erik, who claims that Erik’s Policy Modification (EPM) is a Pareto Improvement over current policy. Consider someone pointing out to Erik that some people want heretics to burn in hell, and that EPM would be bad for such people, since it would make life better for heretics in expectation. If Erik does decide to respond, he would presumably say something along the lines of: it is not the job of economic policy to satisfy people like this. He probably never explicitly decided to ignore such people. But his entire field is based on the assumption that such people do not need to be taken into consideration when outlining economic policy. When having a political argument about economic policy, such people are in fact not really an obstacle (if they do participate, they will presumably oppose EPM with arguments that do not mention hellfire). The implicit assumption that such positions can be ignored thus holds in the context of debating economic policy. But this assumption breaks when we move the concept to the AI context (where every single type of fanatic is informed, extrapolated, and actually given a very real, and absolute, veto over every single thing that is seen as important enough).
Let’s look a bit at another Pareto Baseline that might make it easier to see the problem from a different angle (this thought experiment is also relevant to some straightforward ways in which one might further modify your proposed Pareto Baseline in response to Bob2). Consider the Unpleasant Pareto Baseline (UPB). In UPB the AI implements some approximation of everyone burning in hell (specifically: the AI makes everyone experience the sensation of being on fire for as long as it can). It turns out that it only takes two people to render the set of Pareto Improvements relative to UPB empty: Gregg and Jeff from my response to Davidad’s comment. Both want to hurt heretics, but they disagree about who is a heretic. Due to incompatibilities in their respective religions, every conceivable mind is seen as a heretic by at least one of them. Improving the situation of a heretic is Not Allowed. Improving the situation of any conceivable person, in any conceivable way, is thus making things worse from the perspective of at least one of them.
Gregg and Jeff do have to be a lot more extreme than Bob or Bob2. They might for example be non-neurotypical (for example sharing a condition that has not yet been discovered). And raised in deeply religious environments, whose respective rules they have adopted in an extremely rigid way. So they are certainly rare. But there only needs to be two people like this for the set of Pareto Improvements relative to UPB to be empty. (presumably no one would ever consider building an AI with UPB as a Pareto Baseline. This thought experiment is not meant to illustrate any form of AI risk. It’s just a way of illustrating a point about attempting to simultaneously satisfy trillions of hard constraints, defined in billions of ontologies)
(I really appreciate you engaging on this in such a thorough and well thought out manner. I don’t see this line of reasoning leading to anything along the lines of a workable patch or a usable Pareto Baseline. But I’m very happy to keep pulling on these threads, to see if one of them leads to some interesting insight. So by all means: please keep pulling on whatever loose ends you can see)
I’m much less convinced by Bob2′s objections than by Bob1′s objections, so the modified baseline is better. I’m not saying it’s solved, but it no longer seems like the biggest problem.
I completely agree that it’s important that “you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies”. On the other hand, the set of actions is potentially even larger, with septillions of reachable stars. My instinct is that this allows a large number of Pareto improvements, provided that the constraints are not pathological. The possibility of “utility inverters” (like Gregg and Jeff) is an example of pathological constraints.
Utility Inverters
I recently re-read What is malevolence? On the nature, measurement, and distribution of dark traits. Some findings:
Such constraints don’t guarantee that there are no Pareto improvements, but they make it very likely, I agree. So what to do? In the article you propose Self Preference Adoption Decision Influence (SPADI), defined as “meaningful influence regarding the adoption of those preferences that refer to her”. We’ve come to a similar place by another route.
There’s some benefit in coming from this angle, we’ve gained some focus on utility inversion as a problem. Some possible options:
Remove utility inverting preferences in the coherently extrapolated delegates. We could call this Coherent Extrapolation of Equanimous Volition, for example. People can prefer that Dave stop cracking his knuckles, but can’t prefer that Dave suffer.
Remove utility inverting preferences when evaluating whether options are pareto improvements. Actions cannot be rejected because they make Dave happier, but can be rejected because Dave cracking his knuckles makes others unhappier.
I predict you won’t like this because of concerns like: what if Gregg just likes to see heretics burn, not because it makes the heretics suffer, but because it’s aesthetically pleasing to Gregg? No problem, the AI can have Gregg see many burning heretics, that’s just an augmented-reality mod, and if it’s truly an aesthetic preference then Gregg will be happy with that outcome.
Pareto at Scale
I don’t think we have to frame this as “the AI context”, I think the difference is more about scale. Would this count as Computational Social Choice? Might be interesting to do a literature search. I happened across Safe Pareto Improvements for Delegated Game Playing, which isn’t the right paper, but makes me hopeful of finding something more to the point. The paper also helped me realize that finding the result of a parliament is probably NP-hard.
The proposals described in your points 1 and 2 are very different from any of the proposals that I am analysing in the post. I consider this to be a good thing. But I wanted to note explicitly that this discussion has now moved very far away from what was discussed in the post, and is best seen as a new discussion (a discussion that starts with the proposals described in your points 1 and 2). Making this clear is important, because it means that many points made in the post (and also earlier in this thread) do not apply to the class of proposals that we are now discussing.
In particular: all alignment targets analysed in the post are Group AIs. But the alignment target described in your point 1: Coherent Extrapolation of Equanimous Volition (CEEV), is not a Group AI. Given that the primary focus of the post is to analyse the Group AI idea, the analysis of CEEV below is best seen as starting a completely new discussion. Among other things, this means that many arguments from the post about Group AIs will probably not apply to CEEV. (CEEV is still very bad for individuals. Because it is still the case that no individual has any meaningful influence regarding the way in which CEEV adopts those preferences that refer to her. One specific issue is that some CEEV delegates will still prefer outcomes where heretics are punished, because some delegates will still have an aversion to unethical AIs. The issue is described in detail in the last section of this comment).
The rule for deciding which actions are available to Delegates during negotiations, described in your point 2, is also a large departure from anything discussed in the post. The described rule would accept actions, even though those actions would make things dramatically worse for some people. I think that this makes it a very different kind of rule, compared to Davidad’s proposed Pareto Improvement rule. The points that I made about Pareto Improvements in the post, and earlier in this thread, do not apply to this new class of rules. (The set of actions is still rendered empty by the rule described in your point 2, due to a large and varied set of hard constraints demanding that the AI must not be unethical. A single pair of such demands can render the set empty, by having incompatible views regarding what it means for an AI to be unethical. Some pairs of demands like this have nothing to do with utility inversion. The issue is described in detail in the next section of this comment).
It also makes sense to explicitly note here that with the rule described in your point 2, you have now started to go down the path of removing entire classes of constraints from consideration (as opposed to going down the path of looking for new Pareto Baselines). Therefore, my statement that the path that you are exploring is unlikely to result in a non-empty set no longer applies. That statement was expressing doubt about finding a usable Pareto Baseline that would result in a non-empty set. But in my view you are now doing something very different (and far more interesting) than looking for a usable Pareto Baseline that would result in a non-empty set.
I will spend most of this comment talking about the proposals described in your points 1 and 2. But let’s first try to wrap up the previous topics, starting with Bob2. Bob2 is only different from Bob in the sense that Bob2 does not see an AI that literally never acts as a person. I don’t see why Bob2′s way of looking at things would be strange or unusual. A thing that literally never acts can certainly be seen as a person. But it doesn’t have to be seen as a person. Both perspectives seem reasonable. These two different classifications are baked into a core value, related to the Dark Future concept. (In other words: Bob and Bob2 have different values. So there is no reason to think that learning new facts would make them agree on this point. Because there is no reason to think that learning new facts would change core values). In a population of billions, there will thus be plenty of people that share Bob2′s way of looking at such an AI. So if the AI is pointed at billions of humans, the set of Pareto Improvements will be rendered empty by people like Bob2 (relative to the alternative no-AI-action Pareto Baseline that you discussed here).
Now let’s turn to your point about the size of the action space. Most of my previous points probably do not apply to rules that will ignore entire classes of constraints (such as the “pathological constraints” that you mention). In that case everything depends on how one defines this class of constraints. Rules that do ignore classes of constraints are discussed in the next section of this comment. However: for rules that do not ignore any constraints, the number of actions is not necessarily relevant (in other words: while we are still talking about Pareto Improvements, the number of actions is not necessarily relevant). One can roughly describe the issue as: If one constraint demands X. And another constraint refuses X. Then the set is empty. Regardless of the number of actions.
I’m not sure whether or not there is any significant disagreement left on this issue. But I will still elaborate a bit more on how I see the original situation (the situation where pathological constraints are not ignored).
One can say that everything is short circuited by the fact that humans often have very strong opinions about who should be in charge. (And there are many different types of ontologies that are compatible with such sentiments. Which means that we can expect a great variety in terms of what this implies regarding demands about the AI). Wanting the right type of person to be in charge can be instrumental. But it does not have to be instrumental. And there is nothing unusual about demanding things that are entirely symbolic. (In other words: there is nothing unusual about Dennis, who demands that The Person in Charge must do or value things that have no connection with the material situation of Dennis).
This is not part of every ontology. But caring about who is in charge is a common human value (at least common enough for a population of billions to include a great variety of hard constraints related to this general type of sentiment). The number of actions does not help if one person rejects all trajectories where the person in charge is X. And another person rejects any trajectory unless the person in charge is X. (Combined with the classification of a trajectory that contains a non-misaligned and clever AI, that takes any action, as a trajectory where the first such AI is in charge). (I don’t know if we actually disagree on anything here. Perhaps you would classify all constraints along these lines as Pathological Constraints). (In the next section I will point out that while such incompatible pairs can be related to utility inversion. They do not have to be.)
I will first discuss the proposal described in your point 2 in the next section, and then discuss the proposal described in your point 1 in the last section (because finding the set of actions that are available to delegates happens before delegates start negotiating).
The rule for determining which set of actions will be included in negotiations between delegates
The rule described in your point 2 still results in an empty set, for the same reason that Davidad’s original Pareto Improvement rule results in an empty set. The rule described in your point 2 still does not remove the problem of Bob from the original thought experiment of the post. Because the thing that Bob objects to is an unethical AI. The issue is not about Bob wanting to hurt Dave, or about Bob wanting to believe that the AI is ethical (or that Bob might want to believe that Dave is punished. Or that Bob might want to see Dave being punished). The issue is that Bob does not want the fate of humanity to be determined by an unethical AI.
Demanded punishments also do not have to refer to Dave’s preferences. It can be the case that Gregg demands that Dave’s preferences are inverted. But it can also be the case that Gregg demands that Dave be subjected to some specific treatment (and this can be a treatment that Dave will categorically reject). There is nothing unexpected about a fanatic demanding that heretics be subjected to a specific type of treatment. It is not feasible to eliminate all “Problematic Constraints” along these lines by eliminating some specific list of constraint types (for example along the lines of: utility inverting constraints, or hostile constraints, or demands that people suffer). Which in combination with the fact that Dave still has no meaningful influence over those constraints that are about Dave, means that there is still nothing preventing someone from demanding that things happen to Dave, that Dave finds completely unacceptable. A single such constraint is sufficient for rendering the action space empty (regardless of the size of the action space).
When analysing this type of rule it might actually be best to switch to a new type of person, that has not been part of my past thought experiments. Specifically: the issue with the rule described in your point 2 can also be illustrated using a thought experiment that does not involve any preferences that in any way refer to any human. The basic situation is that two people have incompatible demands regarding how an AI must interact with a specific sacred place or object, in order for the AI to be considered acceptable.
Let’s take ancient Egyptian religion as an example in order to avoid contemporary politics. Consider Intef who was named after the Pharaoh who founded Middle Kingdom Egypt, and Ahmose who was named after the Pharaoh who founded New Kingdom Egypt. They both consider it to be a moral imperative to restore temples to their rightful state (if one has the power to do so). But they disagree on when Egyptian religion was right, and therefore disagree on what the AI must do to avoid being classified as unethical (in the sense of the Dark Future concept).
Specifically: a Middle Kingdom temple was destroyed and the stones were used to build a New Kingdom temple. Later that temple was also destroyed. Intef considers it to be a moral imperative to use the stones to rebuild the older temple (if one has the ability to do so). And Ahmose considers it to be a moral imperative to use the same stones to rebuild the newer temple (if one has the ability to do so). Neither of them thinks that an unethical AI is acceptable (after the AI is classified as unethical the rest of the story follows the same path as the examples with Bob or Bob2). So the set would still be empty, even if a rule simply ignores every constraint that in any way refers to any human.
Neither of these demands are in any way hostile (or vicious, or based in hate, or associated with malevolent people, or belligerent, or anything else along such lines). Neither of these demands is on its own problematic or unreasonable. On its own, either of these demands is in fact trivial to satisfy (the vast majority of people would presumably be perfectly ok with either option). And neither of these demands looks dangerous (nor would they result in an advantage in Parliamentarian Negotiations). Very few people would watch the world burn rather than let Intef use the original stones to rebuild his preferred temple. But it only takes one person like Ahmose to make the set of actions empty.
Let’s go through another iteration and consider AI47 who uses a rule that ignores some additional constraints. When calculating whether or not an action can be used in delegate negotiations, AI47 ignores all preferences that (i): refer to AI47 (thus completely ignoring all demands that AI47 not be unethical), or (ii): refer to any human, or (iii): are dangerous, or (iv): are based on hate / bitterness / spite / ego / etc / etc, or (v): make demands that are unreasonable or difficult to satisfy. Let’s say that in the baseline trajectory that alternative trajectories are compared to, AI47 never acts. If AI47 never acts, then this would lead to someone eventually launching a misaligned AI that would destroy the temple stones (and also kill everyone).
Intef and Ahmose both think that if a misaligned AI destroys the stones, then this counts as the stones being destroyed in an accident (comparable from a moral standpoint to the case where the stones are destroyed by an unpreventable natural disaster). Conditioned on a trajectory where the stones are not used to restore the right temple, both prefer a trajectory where the stones are destroyed by accident. (In addition to caring about the ethics of the AI that is in charge, they also care about the stones themselves.). And there is no way for a non-misaligned, clever AI (like AI47), to destroy the stones by accident (in a sense that they would consider to be equivalent to an unpreventable natural disaster). So the set is still empty.
In other words: even though this is no longer an attempt to find a usable Pareto Baseline that simultaneously satisfies many trillions of hard constraints, a single pair of constraints can still make the set empty. And it is still an attempt to deal with a large set of hard constraints, defined in a great variety of ontologies. It is also still true that (in addition to constraints coming from people like Intef and Bob2) this set will also include constraints defined in many ontologies that we will not be able to foresee (including the ontologies of a great variety of non-neurotypical individuals, that have been exposed to a great variety of value systems during childhood). This is an unusual feature of the AI context (compared to other contexts that deal with human preferences). A preference defined in an ontology that no one ever imagined might exist, has no impact on debates about economic policy. But unless one simply states that a rule should ignore any preference that was not considered by the designers, then the quest to find a rule that actually implies a non-empty set, must deal with this highly unusual feature of the AI context.
(Intef and Ahmose pose a lot more problems in this step, than they pose in the step where delegates are negotiating. In that later step, their delegates have no problematic advantage. Their delegates are also not trying to implement anything worse than extinction. This is probably why this type of person has not been part of any of my past thought experiments. I have not thought deeply about people like Intef and Ahmose)
(There exists several contemporary examples of this general type of disagreements over sacred locations or objects. Even the specific example of reusing temple stones was a common behaviour in many different times and places. But the ancient Egyptians are the undisputed champions of temple stone reuse. And people nowadays don’t really have strong opinions regarding which version of ancient Egyptian religion is the right version. Which is why I think it makes sense to use this example)
(I’m happy to keep exploring this issue. I would not be surprised if this line of inquiry leads to some interesting insight)
(if you are looking for related literature, you might want to take a look at the Sen ``Paradox″ (depending on how one defines “pathological preferences”, they may or may not be related to “nosy preferences”))
(Technical note: this discussion makes a series of very optimistic assumptions in order to focus on problems that remain despite these assumptions. For example assuming away a large number of very severe definitional issues. Reasoning from such assumptions does not make sense if one is arguing that a given proposal would work. But it does make sense when one is showing that a given proposal fails, even if one makes such optimistic assumptions. This point also applies to the next section)
Coherent Extrapolation of Equanimous Volition (CEEV)
Summary: In the CEEV proposal described in your point 1, many different types of fanatics would still be represented by delegates that want outcomes where heretics are punished. For example fanatics that would see a non-punishing AI as unethical. Which means that CEEV still suffers from the problem that was illustrated by the original PCEV thought experiment. In other words: having utility inverting preferences is one possible reason to want an outcome where heretics are punished. Such preferences would not be present in CEEV delegates. But another reason to want an outcome where heretics are punished is a general aversion to unethical AIs. Removing utility inverting preferences from CEEV delegates would not remove their aversion to unethical AIs. Yet another type of sentiment that would be passed on to CEEV delegates, is the case where someone would want heretics to be subjected to some specific type of treatment (simply because, all else equal, it would be sort of nice if the universe ended up like this). There are many other types of sentiments along these lines that would also be passed on to CEEV delegates (including a great variety of sentiments that we have no hope of comprehensively cataloguing). Which means that many different types of CEEV delegates would still want an outcome where heretics are hurt. All of those delegates would still have a very dramatic advantage in CEEV negotiations.
Let’s start by noting that fanatics can gain a very dramatic negotiation advantage in delegate negotiations, without being nearly as determined as Gregg or Bob. Unlike the situation discussed in the previous section, in delegate negotiations people just need to weakly prefer an outcome where heretics are subjected to some very unpleasant treatment. In other words: people can gain a very dramatic negotiation advantage simply because they feel that (all else equal) it would be sort of nice to have some type of outcome, that for some reason involves bad things happening to heretics.
There exists a great variety of reasons for why someone might have such sentiments. In other words: some types of fanatics might lose their negotiation advantage in CEEV. But many types of fanatics would retain their advantage (due to a great variety of preferences defined in a great variety of ontologies). Which in turn means that CEEV suffers from the same basic problem that PCEV suffers from.
You mention the possibility that an AI might lie to a fanatic regarding what is happening. But a proposed outcome along such lines would change nothing. CEEV delegates representing fanatics that have an aversion to unethical AIs would for example have no reason to accept such an outcome. Because the preferences of the fanatics in question is not about their beliefs regarding unethical AIs. Their preferences are about unethical AIs.
In addition to fanatics with an aversion to unethical AIs, we can also look at George, who wants heretics to be punished as a direct preference (without any involvement of preferences related to unethical AIs). George might for example want all heretics to be subjected to some specific treatment (demands that heretics be subjected to some specific treatment are not unusual). No need for anything complicated or deeply felt. George might simply feel that it would be sort of nice if the universe would be organised like this (all else equal).
George could also want the details of the treatment to be worked out by a clever AI (without referring to any form of utility inversion or suffering. Or even referring in any way to any heretic, when specifying the details of the treatment). George might for example want all heretics to be put in whatever situation, that would make George feel the greatest amount of regret. In other words: this type of demand does not have to be related to any form of utility inversion. The details of the treatment that George would like heretics to be subjected to, does not even need to be determined by any form of reference to any heretic. In yet other words: there are many ways for fanatics along the lines of George to gain a very large negotiation advantage in CEEV. (The proposal that CEEV might lie to George about what is happening to heretics would change nothing. Because George’s preference is not about George’s beliefs.)
The type of scenario that you describe, where George might want to see Dave being hurt, is not actually an issue here. Let’s look more generally at George’s preferences regarding George’s experiences, George’s beliefs, George’s world model, etc. None of those pose a problem in original PCEV (because they do not result in a negotiation advantage for George’s delegate). (We might not have any actual disagreement regarding these types of preferences. I just wanted to be clear about this point).
From the perspective of Steve, the underlying issue with CEEV is that Steve still has no meaningful control over the way in which CEEV adopts those preferences that refer to Steve. Which in turn means that Steve still has no reason to think that CEEV will want to help Steve, as opposed to want to hurt Steve. This point would remain true even if one were to remove additional types of preferences from delegates.
Eliminating some specific list of preference types (for example along the lines of: utility inverting preferences, or hostile preferences, or preferences that people suffer, etc) does not qualitatively change this situation. Because eliminating such a list of preference types does not result in Steve gaining meaningful influence regarding the adoption of those preferences that refer to Steve. Which in the case of Parliamentarian Negotiations means that delegates will still want to hurt Steve, for a great variety of reasons (for example due to sentiments along the lines of an aversion to unethical AIs. And also due to a long and varied list of other types of sentiments, that we have no hope of exhaustively cataloguing).
In other words: all those delegates that (for reasons related to a great variety of sentiments) still want outcomes where people are subjected to horrific forms of treatment, will still have a very large negotiation advantage in CEEV. And such delegates will also have a very large negotiation advantage in any other proposal without the SPADI feature, that is based on the idea of eliminating some other specific list of preference types from delegates.
Since this discussion is exploring hypotheticals (as a way of reaching new insights), I’m happy to keep looking at proposals without the SPADI feature. But given the stakes, I do want to make a tangential point regarding plans that are supposed to end with a successfully implemented AI without the SPADI feature (presumably as the end point of some larger plan that includes things along the lines of: an AI pause, augmented humans, an initial Limited AI, etc, etc).
In other words: I am happy to keep analysing proposals without the SPADI feature. Because it is hard to predict what one will find when one is pulling on threads like this. And because analysing a dangerous proposal reduces the probability of it being implemented. But I also want to go on a tangent and explain why successfully implementing any AI without the SPADI feature would be extremely bad. And explicitly note that this is true regardless of which specific path one takes to such an AI. And also explicitly note that this is true, regardless of whether or not anyone manages to construct a specific thought experiment illustrating the exact way in which things go bad.
Let’s look at a hypothetical future proposal to illustrate these two points. Let’s say that someone proposes a plan that is supposed to eventually lead to the implementation of an AI that gets its preferences from billions of humans. This AI does not have the SPADI feature. Now let’s say that this proposed alignment target avoids the specific issues illustrated by all existing thought experiments. Let’s further say that no one is able to construct a specific thought experiment that illustrates exactly how this novel alignment target proposal would lead to a bad outcome. The absence of a thought experiment that illustrates the specific path to a bad outcome, would not in any way shape or form imply that the resulting AI does not want to hurt Steve, if such a proposed plan is successfully implemented. In other words: since Steve will have no meaningful influence regarding the adoption of those preferences that refer to Steve, Steve will have no reason to expect the actual resulting AI to want to help Steve, as opposed to want to hurt Steve. PCEV implied a massively worse than extinction outcome, also before the specific problem was described (and PCEV spent a lot of years as a fairly popular proposal without anyone noticing the issue).
In yet other words: the actual AI, that is actually implied, by some proposed set of definitions, can end up wanting to hurt Steve, regardless of whether or not someone is able to construct a thought experiment that illustrates the exact mechanism by which this AI will end up wanting to hurt Steve. Which in combination with the fact that Steve does not have any meaningful influence regarding the adoption of those preferences that refer to Steve, means that Steve has no reason to expect this AI to want to help Steve, as opposed to want to hurt Steve.
In yet other words: the SPADI feature is far from sufficient for basic safety. But it really is necessary for basic safety. Which in turn means that if a proposed AI does not have the SPADI feature, then this AI is known to be extremely bad for human individuals in expectation (if successfully implemented). This is true with or without a specific thought experiment illustrating the specific mechanism that would lead to this AI wanting to hurt individuals. And it is true regardless of what path was taken to the successful implementation of such an AI. (Just wanted to be explicit about these points. Happy to keep analysisng proposals without the SPADI feature.)
(you might also want to take a look at this post)