My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky’s description of the issue you can search the CEV arbital page for ADDED 2023.
The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.
I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don’t hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It’s a Gavagai / Word and Object joke from my grad student days)
My background is physics as an undergrad and then AI research. Links to some papers: P1 P2 P3 P4 P5 P6 P7 P8. (no connection to any form of deep learning). My CV.
(I don’t use LLMs. I have never used an LLM for any part of my writing process)
The problem illustrated by the thought experiment in the post would remain, even if PCEV were to be pointed at a group of humans that does not include any form of religion whatsoever. PCEV would still give the largest influence bonus to whatever individuals in that group are the most determined to hurt others. Given that this feature of PCEV is now known, it is presumably very unlikely that PCEV will ever be implemented in any form. However, the thought experiment in the post can also be used to illustrate a deeper, more foundational problem, that is also present in a wide range of other proposals. This more foundational problem is for example also present in many potential modified versions of PCEV, as well as in some related proposals that might still end up successfully implemented. I will briefly describe this more foundational problem below, and I will also link to some more recent posts and comments that include more thorough definitions and arguments. In other words: below I will describe a more foundational problem that is present in PCEV under a wide range of different hypotheses regarding the current human population, and in various potential modified versions of PCEV, and also in many other Sovereign AI proposals (some of whom still pose a risk). I will finally describe a more general danger: the danger from situations where a Sovereign AI proposal with some form of hidden problem ends up successfully implemented, before anyone notices this hidden problem.
In yet other words: any successfully implemented PCEV will always give an advantage to any individual that intrinsically values hurting other humans. And any successfully implemented PCEV will always give a larger advantage to individuals that wants to inflict more severe harm, compared to individuals that wants to inflict less severe harm. This feature of the PCEV proposal will always remain the same, regardless of what group you point PCEV at. (Which is basically equivalent to saying that you should expect this feature to remain the same regardless of what you believe about the current human population). Pointing PCEV at different populations could certainly change the details of the outcome. Many of the details would for example presumably be very different in a scenario where an ASI ends up in the hands of sadistic psychopaths, or in the hands of people with a reflectively stable version of this type of compulsion, compared to a scenario where an ASI ends up in the hands of the religious fanatics described in the post. But this influence bonus feature of PCEV is there no matter what. PCEV will always grant the most influence to whatever subgroup turns out to have the most insatiable, and the most reflectively stable, determination to hurt others.
The fact that PCEV has this influence bonus feature is a sign that PCEV suffers from an underlying problem (a problem that is more general than the specific negotiation rules used in PCEV, and also more general than the type of religious fanatic discussed in the post. More below). And the fact that this feature of PCEV went unnoticed for so many years is a sign of a more general danger (a structural danger that is not strongly related to the specifics of any particular Sovereign AI proposal. Similar to how there exists a structural misalignment-danger connected to race-dynamics, that is not strongly related to the specifics of any particular AI company. More below). Now that this feature of PCEV is known, the danger from scenarios where PCEV ends up successfully implemented has presumably been mostly removed (which shows that reducing this more general danger is actually a tractable research project. More below).
In yet other words: the fact that wanting to hurt others leads to an influence bonus in PCEV, shows that something is deeply wrong with this specific Sovereign AI proposal. And the fact that this very problematic feature of PCEV went unnoticed for so many years indicates that there exists a more general structural danger: the danger that a Sovereign AI proposal with a hidden problem might eventually end up getting successfully implemented. And the fact that the specific danger from scenarios where PCEV ends up successfully implemented has been reduced, shows that it is possible to mitigate this general class of dangers. Bellow I will go into a bit more detail on these issues, and also link to some more recent posts and comments (for example arguing that there might not be a lot of time to look for problems with Sovereign AI proposals, and arguing that we might not get a lot of help with this from earlier, non-Sovereign, AIs).
(More generally I think that these topics are genuinely unintuitive and very difficult to reason about. And I think that this difficulty implies a very under-appreciated danger. The fact that this feature of PCEV went unnoticed for so many years is just one incident that makes me think this. This incident is easy to describe and it really should be sufficient for proving the point. But I also have other reasons for thinking that these topics are genuinely unintuitive. Which, combined with the high stakes involved, makes me think that many people are being very reckless about dangers related to successfully implemented Sovereign AI proposals (in a way that is distinct from, and not straightforwardly related to, the way in which some people are being extremely reckless about misalignment-related dangers)).
This post presents a longer argument for the conclusion that the problem with PCEV is just one specific instance of a deeper, underlying, problem. And it also presents a longer argument for the conclusion that this underlying problem is far more general than the specific details of PCEV (or the specific details of any particular type of individual). I include a brief overview below.
Consider the class of scenarios where a successfully implemented Sovereign AI proposal derives its goal entirely from a set of billions of humans. Now consider Steve, a human individual in this set who does not receive any special treatment. If the proposal in question is PCEV, then PCEV will adopt preferences that refer to Steve. And PCEV will do so, using a mechanism that Steve will have no meaningful influence over. Since Steve had no meaningful influence regarding the adoption of those preferences that refer to Steve, there is no particular reason for Steve to think that the resulting AI will want to help him, as opposed to want to hurt him. (In particular, it means that if Steve already knows that he will have no meaningful influence regarding the way that Steve-referring preferences are adopted by PCEV, then there is no rational reason for Steve to be surprised, when he discovers that a successfully implemented PCEV would be extremely bad for him). The post mentioned above argues that this lack of influence over the adoption of self-referring preferences is the actual, underlying, problem that is causing Steve to be hurt. And it argues that this underlying problem is not just present in PCEV.
It’s important to note that this underlying problem is not inherent in the class of scenarios mentioned in the previous paragraph (where a successfully implemented AI Sovereign gets its goal entirely from a set of billions of humans that include Steve, using a procedure that does not grant Steve any special treatment). There is simply no reason for this setup to result in an AI that adopts Steve-referring preferences in a way that Steve has no control over. Which means that if Steve does face such an AI, then this is a result of a deliberate and optional design choice. Which in turn means that if Steve (and by extension everything that Steve cares about), is subjected to the risks associated with an ASI that has adopted Steve-referring preferences in a way that Steve had no control over, then these risks are the result of specific and optional design choices. In other words: Steve demanding meaningful influence regarding the way in which the AI adopts those preferences that refer to Steve, is not a demand for special treatment. Because one would not introduce any contradictions by specifying that every person in the group that the AI derives its goal from has meaningful influence regarding the way that the AI adopts this type of self-preferences. Which in turn means that all risks implied by the lack of such influence, are risks that the designers have chosen to introduce, as the predictable side effect of an optional design choice.
In yet other words: Steve being granted meaningful influence regarding the way in which an AI adopts those preferences that refer to Steve, does not imply that Steve has been granted any form of special treatment. A Sovereign AI proposal denying Steve meaningful influence regarding the adoption of those preferences that refer to Steve, is thus an optional and deliberate design choice (even for designers that are restricted to the class of scenarios mentioned above). If the designers were trying to construct a Sovereign AI proposal that is good from the perspective of human individuals, then this seems to me to be a very strange design choice to make. Even before one looks at any further details, this is the type of design choice that one should expect to generate extreme danger to everything that human individuals tend to care about (especially when the designers are operating in a context that is genuinely unintuitive, and where a mistake could lead to very dangerous situations (including situations where an AI Sovereign ends up dominated by whatever subgroup of people turns out to have the most reflectively stable and insatiable determination to demand that the AI hurt other people as much as possible)). The post mentioned above digs into the details, and argues that this design choice would in fact be a very serious mistake from the perspective of essentially any human individual (which is basically equivalent to saying that essentially any human individual that is reading this, should reject proposals that make this design choice). It also argues that common safety-implying intuitions are in fact based on common confusions related to the relationship between groups and individuals.
The post mentioned above does not need to assume much about Steve to show that this design choice would be a serious mistake from Steve’s perspective. The argument for example works if Steve only cares about other humans, and it works if Steve only cares about himself, and it works if Steve cares equally about all sentient lifeforms, etc, etc, etc. (Because an ASI trying to hurt Steve as much as possible would be very good at coming up with clever ways of getting to whatever it is that Steve cares about). The argument against this design choice should thus be compelling for essentially any reader. In other words: if some assumption about Steve is false for a given reader, then the argument might not be compelling for that reader. But the argument really does not need to assume much about Steve, to show that from Steve’s perspective it would be bad if an AI Sovereign were to adopt preferences that refer to Steve from billions of humans, using a mechanism that Steve had no meaningful influence over (partly because an ASI that wants to hurt Steve would be very good at finding clever ways of targeting whatever it is that Steve cares about, even if that ASI is operating under various constraints related to simultaneously trying to do many other things).
We can view this from a different angle. Let’s instead consider Sovereign AI proposals that does in fact grant meaningful influence regarding the way that these types of self-preferences are adopted, to every individual in the set that the AI gets its goal from. To be able to refer to such proposals I have provisionally introduced the rather clumsy acronym: Self Preference Adoption Decision Influence (SPADI), for a Sovereign AI proposal feature. In other words: a given proposal is said to have the SPADI feature iff each person in the group that the AI derives its goal from, is given meaningful influence regarding the adoption of these types of self-preferences. The post mentioned above argues that for a Sovereign AI proposal, the SPADI feature is a necessary (but far from sufficient) feature.
(To very briefly give a very rough but slightly more fleshed out explanation of what I mean when I say that the SPADI feature is a necessary Sovereign AI proposal feature: for essentially any human individual that is not granted any special treatment, the fact that a Sovereign AI proposal is known to lack the SPADI feature, is a sufficient reason to reject this proposal. This is not true for all possible minds. If Dave is confident that there is no way for an ASI to hurt anything that Dave cares about, then Dave might not find any of this compelling. If Jeff is extremely focused on inflicting extreme suffering on other individuals at essentially any cost, then including the SPADI feature might be a bad idea from Jeff’s perspective. So the SPADI feature is not in any sense objectively correct. The successful implementation of an AI Sovereign that lacks the SPADI feature is not in any sense objectively bad. Using a disagreement resolution method that is incompatible with the SPADI feature is not in any sense objectively wrong. And the SPADI feature is not a necessary feature from the perspective of all conceivable minds. But from the perspective of essentially any human individual, the SPADI feature is a necessary feature of a Sovereign AI proposal).
The necessity of the SPADI feature has implications when analysing potential versions of PCEV that for example excludes fanatics, or that disregards fanatical sentiments, or that uses an extrapolation dynamic that always results in non-fanatical delegates. Or when analysing any other proposal along similar lines. (Or when analysing a scenario where PCEV is pointed at a large group of humans that for example do not contain any fanatics, or that do not contain any hate. Or when analysing any other scenario along similar lines). (Or when reasoning from the assumption that the current human population contain no reflectively stable fanatics. Or when reasoning from any other assumption along similar lines). Even if one were to, for example, remove every single trace of any form of religion from the set of people that PCEV gets its goal from, then it would still be rational for Steve to oppose the successful implementation of PCEV. Because the result would still be an AI Sovereign that derives Steve-referring preferences from billions of humans in a way that Steve has no meaningful influence over (which would be extremely dangerous for essentially anything that Steve might care about, due to the ability of an ASI to find clever ways of targeting whatever it is that Steve cares about).
In other words: the various proposals and scenarios mentioned in the previous paragraph all include an AI Sovereign that lacks the necessary SPADI feature. Which, as argued in the post mentioned above, is a sufficient reason for rejecting a proposal, even without a specific thought experiment showing exactly how things would go wrong. Because things can go wrong in reality even if no one is able to describe a specific thought experiment before implementation. Which in turn (in combination with the fact that Steve still has no meaningful influence regarding the adoption of Steve-referring preferences, and the fact that these topics are genuinely unintuitive) means that Steve still has no rational reason to expect that the resulting AI Sovereign will actually want to help Steve, as opposed to want to hurt Steve (which would be very bad from Steve’s perspective, pretty much regardless of what he cares about. Because an ASI trying to hurt Steve would be very good at coming up with clever ways of targeting whatever it is that Steve cares about).
These findings are useful in some cases, but it’s worth noting that there are some important limitations that reduce how useful they actually are in practice. There are two aspects that reduce the usefulness of establishing that the SPADI feature is necessary. The first aspect is that it will often be unclear whether or not a given proposal is describable as having the SPADI feature. The second aspect is that, since the SPADI feature has been identified as necessary but not sufficient, talking about the SPADI feature would not be very useful for someone that is trying to come up with a valid argument in favour of some specific proposed AI project being safe to pursue. Due to these two aspects, the fact that the SPADI feature has been established as necessary is basically only relevant in cases where it is clear that a given proposal does not have the SPADI feature. However, in cases where it actually is clear that a given Sovereign AI proposal does not have the SPADI feature, then knowing that the SPADI feature is necessary is actually very informative. It is for example clear that the PCEV proposal does not have the SPADI feature (by any reasonable set of definitions). Which in turn means that these conclusions (about the necessity of the SPADI feature) allows us to reject PCEV (or any modified version of PCEV that still clearly lacks the SPADI feature), even without any specific thought experiment. And being able to reject dangerous proposals is useful for the task of risk mitigation (which is the goal of the research project that these posts are examples of: Alignment Target Analysis (ATA)).
It is also worth noting that PCEV is not the only Sovereign AI proposal that clearly lacks the SPADI feature. The post mentioned above explains why no version of Yudkowsky’s proposal to build an AI Sovereign that implements the CEV of Humanity (CEVH) is describable as having the SPADI feature (the post describes a thought experiment which shows that the core concept of the CEVH proposal is incompatible with the SPADI feature).
Below I will go off on a bit of a tangent, because there are some related points that I would like to make here. This post contains a longer explanation of why it can be useful to discover that a feature is necessary, even if that feature is very far from being sufficient, and even if it is often very unclear whether or not a given proposal has the feature in question. (In brief: Sometimes it is in fact clear that a given proposal definitely does lack a specific feature. And in those cases it is very useful to know that the feature in question is necessary, especially from a risk mitigation perspective). That explanation is outlined in the context of a necessary Membrane formalism feature, but the argument is general enough to also apply to necessary Sovereign AI proposal features. Since the goal of Alignment Target Analysis (ATA) is risk mitigation, identifying necessary but not sufficient features is very valuable (because it reduces the risk from every proposal that clearly lacks a feature that has been identified as necessary. Which in turn means that identifying such features can reduce the risk from potential future Sovereign AI proposals that no one has yet come up with. For example in cases where CEVH is modified in such a way that the new proposal still clearly lacks the SPADI feature).
I will finally say a bit more here about the more general class of dangers that the PCEV thought experiment outlined in the post can be used to illustrate. The danger that PCEV (or some other version of CEVH, or some other Sovereign AI proposal without the SPADI feature) might have ended up successfully implemented is just one example of a much more general danger: the danger from scenarios where we end up with a successfully implemented Sovereign AI proposal that has never been properly analysed. This is a far more general danger, and it can be reduced by pursuing the Alignment Target Analysis (ATA) research direction mentioned above. This post both defines ATA, and also argues that ATA needs to be pursued now. It for example argues that we might not have a lot of time for ATA later. It also argues that we might not get a lot of help with ATA from things along the lines of: non-Sovereign ASIs, non-superintelligent AGIs, non-AGI LLMs, Augmented Humans, etc, etc. See also this comment for an argument that analysing Sovereign AI proposals would be useful from a wide range of perspectives, including from the perspective of someone who is categorically against ever building any form of AI Sovereign. And see this comment for another attempt to clarify the scope of ATA.
This post argues that we might end up with a successfully implemented Sovereign AI proposal without ever having had a lot of time for ATA. This post, this comment, and this comment, all contain some discussion that is related to the argument that a non-Sovereign ASI might not be a lot of help with ATA. This comment contain some discussion related to the argument that there might not be a lot of time for ATA, and also some discussion related to the argument that we might not get a lot of help with ATA.