ThomasCederborg

Karma: 187

My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky’s description of the issue you can search the CEV arbital page for ADDED 2023.

The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.

I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don’t hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It’s a Gavagai / Word and Object joke from my grad student days)

My background is physics as an undergrad and then AI research. Links to some papers: P1 P2 P3 P4 P5 P6 P7 P8. (no connection to any form of deep learning)

ThomasCederborg 22 May 2025 5:05 UTC
1 point
0
in reply to: Seth Herd’s comment on: A problem shared by many different alignment targets
It seems to me that we are going in circles and talking past each other to some degree in the discussion above. So I will just briefly summarise my position on the main topics that you raise (I’ve argued for these positions above. Here I’m just summarising). And then I will give a short outline of the argument for analysing Sovereign AI proposals now.
Regarding the relative priority of different research efforts:
The type of analysis that I am doing in the post is designed to reduce one of the serious AI risks that we face. This risk is due to a combination of the fact that (i): we might end up with a successfully implemented Sovereign AI proposal that has not been analysed properly, and the fact that (ii): the successful implementation of a reasonable sounding Sovereign AI proposal might lead to a massively worse than extinction outcome. In other words: reducing the risk of a massively worse than extinction outcome is a tractable research project (specifically: this risk can be reduced by analysing the types of alignment targets that the post is analysing). This research project is currently not being pursued. Other efforts are needed to reduce other types of risks. And it is certainly possible for reasonable people to disagree substantially on how attention would best be allocated. But it still seems very clear to me that the current situation is a serious mistake.
I don’t actually know what the optimal allocation of attention would be. But I have been in contact with a lot of people during the last few years. And I have never gotten any form of pushback when I say that there currently exists exactly zero people in the world dedicated to the type of analysis that I am talking about. So whatever the optimal ratio is, I am confident that the type of analysis that I am advocating for deserves more attention. (It might of course be perfectly reasonable for a given AI safety researcher to decide to not personally pursue this type of analysis. But I am confident that the overall situation is not reasonable. It simply cannot be reasonable to have zero people dedicated to a tractable research project, that reduces the probability of a massively worse than extinction outcome).
Regarding the type of Instruction Following AGI (IFAGI) that you mention:
The successful implementation of such an IFAGI would not reliably prevent a Sovereign AI proposal from being successfully implemented later. And this Sovereign AI proposal might be implemented before it has been properly analysed. Which means that the IFAGI idea does not remove the need for the type of risk-mitigation focused research project that the post is an example of. In other words: Such an IFAGI might not result in a lot of time to analyse Sovereign AI proposals. And such an IFAGI might not be a lot of help when analysing Sovereign AI proposals. So even if we assume that an IFAGI will be successfully implemented, then this would still not remove the need for the type of analysis that I am talking about. (Conditioned on such an IFAGI being successfully implemented, we might get a lot of time. And we might get a lot of help with analysis. But we might also end up in a situation where we do not have much time, and where the IFAGI does not dramatically increase our ability to analyse Sovereign AI proposals)
Regarding perfect solutions and provably safe AI:
I am not trying to do anything along the lines of proving safety. What I am trying to do is better described as trying to prove un-safety. I look at some specific proposed AI project plan. (For example an AI project plan along the lines of: first humans are augmented. Then those augmented humans builds some form of non-Sovereign AI. And then they use that non-Sovereign AI to build an AI Sovereign, that implements the CEV of Humanity). And then I explain why the success of this project would be worse than extinction (in expectation. From the perspective of a human individual. For the reasons outlined in the post). So I am in some sense looking for definitive answers. But more along the lines of provable catastrophe than provable safety. What I am trying to do is a bit like attempting to conclusively determine that a specific location contains a landmine (where a specific AI project plan being successfully implemented, is analogous to a plan that ends with someone standing on the location of a specific landmine). It is very different from attempting to conclusively determine that a specific path is safe. (Just wanted to make sure that this is clear).
A very brief outline of the argument for analysing Sovereign AI proposals now:
Claim 1: We might end up with a successfully implemented AI Sovereign. Even if the first clever thing created is not an AI Sovereign, an AI Sovereign might be developed later. Augmented humans, non-Sovereign AIs, etc, might be followed by an AI Sovereign. (See for example the proposed path to an AI Sovereign described on the CEV arbital page).
Claim 2: In some scenarios that end in a successfully implemented AI Sovereign, we will not get a lot of time to analyse Sovereign AI proposals. (For example due to Internal Time Pressure. See also this subsection for an explanation of why shutting down competing AI projects might not buy a lot of time. See also the last section of this comment, which outlines one specific scenario where a tool-AI successfully shuts down all unauthorised AI projects, but does not buy a lot of time).
Claim 3: In some scenarios that end in a successfully implemented AI Sovereign, we will not get a lot of help with analysis of Sovereign AI proposals. (Partly because asking an AI for a good Sovereign AI proposal is like asking an AI what goal it should have. See also this subsection on the idea of having AI assistants helping with analysis. This subsection and this section argues that augmented humans might turn out to be good at hitting alignment targets, but not good at analysing alignment targets).
Claim 4: A reasonable sounding Sovereign AI proposal might lead to a massively worse than extinction outcome. (See for example the PCEV thought experiment).
Claim 5: Noticing such issues is not guaranteed. (For example illustrated by the fact that the problem with PCEV went unnoticed for many years).
Claim 6: Reducing the probability of such outcomes is possible. Reducing this probability is a tractable research project, because risk can be reduced without finding any good Sovereign AI proposals. (For example illustrated by the present post, or the PCEV thought experiment).
Claim 7: There exists exactly zero people in the world dedicated to this tractable way of reducing the probability of a massively worse than extinction outcome. (It is difficult to prove the non-existence of something. But I have been saying this for quite a while now, while talking to a lot of different people. And I have never gotten any form of pushback on this).
Conclusion: We might end up in a worse than extinction outcome, because a successfully implemented Sovereign AI proposal has a flaw that was realistically findable. It would make sense to spend a non-tiny amount of effort on reducing the probability of this.
(People whose intuition says that this conclusion must surely be false in some way, could try to check whether or not this intuition is actually based on anything real. The most straightforward way would be to spell out the actual argument for this in public, so that the underlying logic can be checked. Acting based on the assumption that such an intuition is based on anything real, without at least trying to evaluate it first, does not sound like a good idea)

ThomasCederborg 12 May 2025 12:41 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Why I am not a successionist
If I understand your comment correctly, then I think that the claim about preference variation made in the second sentence is wrong. Specifically, I think that a significant fraction of people do have strong preferences, such that fulfilling those preferences would be very bad for a significant fraction of people. Some people would for example strongly prefer that earths ecosystems be destroyed (as a way of preventing animal suffering). Others would strongly prefer to protect earths ecosystems. The partitioning idea that you mention would be irrelevant to this value difference. There also exists no level of resources that would make this disagreement go away. (This seems to me to be in contradiction with the claim that is introduced in the second sentence of your comment).
I don’t think that this is an unusual type of value difference. I think that there exists a wide range of value differences such that (i): fulfilling a strong preference of a significant fraction of people leads to a very bad outcome for another significant fraction of people, (ii): partitioning is irrelevant, and (iii): no amount of resources would help.
Let’s for example examine the primary value difference discussed in the Archipelago text that you link to in your earlier comment. Specifically the issue of what religion future generations will grow up in. And what religious taboos future generations will be allowed to violate. Consider a very strict religious community that view apostasy as a really bad thing. They want to have children, and they reject non-natural ways of creating or raising children. One possibility is an outcome where children are indoctrinated to ensure that essentially all children born in this community will be forced to adapt to their very strict rules (which is what would happen if the communities of a partitioned world have autonomy). This would be very bad for everyone who cares about the well-being of those children that would suffer under such an arrangement. To prevent such harm, one would have to interfere in a way that would be very bad for most members of this community (for example by banning them from having children. Or by forcing most members of this community to live with the constant fear of loosing a child to apostasy (in the Archipelago world described in your link, this would for example happen because of the mandatory UniGov lectures that all children must attend)).
We can also consider a currently discussed, near-term, and pragmatic, ethics question: what types of experiences are acceptable for what types of artificial minds? I see no reason to think that people will agree on which features of an artificial mind, implies that this mind must be treated well (or agree on what it means for different types of artificial minds to be treated well). This is partly an empirical issue, so some disagreements might be dissolved by better understanding. But it also has a value component. And there will presumably be a wide range of positions on the value component. Partitioning is again completely irrelevant to these value differences. (And additional resources actually makes the problem worse, not better).
Extrapolated humans will presumably have value differences along these lines about a wide range of issues that have not yet been considered.
There thus seems to exist many cases where variations in preference will in fact result in bad outcomes for a significant fraction of people (in the sense that many cases exists, where fulfilling a strong preference held by a significant fraction, would be very bad for a significant fraction of people).
More straightforwardly, but less exactly, one could summarise my main point as: for many people, many types of bad outcomes cannot be compensated for by giving someone the high-tech equivalent of a big pile of gold. There is no amount of gold that will make Bob ok with the destruction of earths ecosystems. And there is no amount of gold that will make Bill ok with animals suffering indefinitely in protected ecosystems. And the way things are partitioned would rarely be relevant for these types of conflicts. And if resources do happen to matter in some way, then additional resources would usually make things worse. Even more straightforwardly: you cannot compensate for one part of the universe being filled with factory farms by filling a different part of the universe with vegans.
In conclusion, it seems to me that there exists a wide range of strong preferences (each held by a significant fraction of people), such that fulfilling this preference would be very bad for a significant fraction of people. It seems to me that this falsifies the additional claim that you introduce in the second sentence of your comment.
(Some preference differences can indeed be solved with partitioning. So it is indeed possible for a given set of minds to have preference differences without having any conflicts. Which indeed means that the existence of preference differences does not, on its own, conclusively prove the existence of conflict. I am however only discussing the additional claim that you introduce in the second sentence of this comment. In other words: I am narrowly focusing on the novel claim that you seem to be introducing in the second sentence. I am just picking out this specific claim. And pointing out that it is false)
Technical note on scope:
My present comment is strictly limited to exploring the question of how one fraction of the population would feel, if a preference held by another fraction of the population were to be satisfied. I stay within these bounds, partly because any statement regarding what would happen, if an AI were to implement The Outcome that Humanity would like an AI to implement, would be a nonsense statement (because any such statement would implicitly assume that there is some form of free floating Group entity, out there, that has an existence that is separate from any specific mapping or set of definitions (see for example this post and this comment for more on this type of “independently-existing-G-entity” confusion)).
A tangent:
I think that from the perspective of most people, humans actively choosing an outcome would be worse than this outcome happening by accident. Humans deciding to destroy ecosystems will usually be seen as worse than ecosystems being destroyed by accident (similarly to how most people would see a world with lots of murders as worse than a world with lots of unpreventable accidents). The same goes for deciding to actively protect ecosystems despite this leading to animal suffering (similarly to how most people would see a world with lots of torture as worse than a world with lots of suffering from unpreventable diseases).

ThomasCederborg 22 Apr 2025 14:53 UTC
1 point
0
in reply to: Noosphere89’s comment on: A problem shared by many different alignment targets
The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. For example: the path to an AI that implements the CEV of Humanity outlined on the CEV arbital page starts with an initial non-Sovereign AI (and this initial AI could be the type of instruction following AI that you mention). In other words: the successful implementation of an instruction following AI does not prevent the later implementation of a Group AI. It is in fact one step on the classical path to a Group AI. So even if we assume that the first clever AI will be an instruction following AI, then this does not remove the need to analyse Sovereign AI proposals. In some scenarios where we end up with a successfully implemented AI Sovereign, we will not have a lot of time for Alignment Target Analysis (ATA), and we will not get a lot of help with ATA. This is dangerous. Because it is currently not possible to reliably tell if a reasonable sounding proposal implies a massively worse than extinction outcome. The last section of this comment goes into more detail about this, and links to previous posts and comments. But first I will respond to your comment about CEV, and try to clarify what type of proposals the present post is analysing.
Regarding your comment about CEV:
One can propose to Coherently Extrapolate the Volition of all sorts of things for all sorts of reasons. The acronym CEV can thus be used as a shorthand for all sorts of AI proposals. Including proposals to build an AI that effectively does whatever one single individual wants that AI to do. And if such a proposal is explained in an unclear way then readers might get the mistaken impression that the proposed AI is supposed to do what Humanity wants that AI to do (in some sense). But such proposals are out of scope of the present post.
None of the alignment targets that I am analysing in the present post are like this. I am instead analysing proposals along the lines of Yudkowsky’s proposal to build an AI that implements the Coherent Extrapolated Volition of Humanity. (To me, it seems that this text is explicitly and unambiguously proposing to build an AI that genuinely does what Humanity wants done (in some sense)). In other words: each proposal analysed in the post is genuinely a proposal to create an AI that is built on top of a CEVH mapping. None of them is a proposal to create an AI that is effectively built on top of a CEVI mapping. In yet other words: they are all, genuinely, Group AI proposals. My point is that such Group AI proposals are either confused (effectively proposing to build an AI that implements the will of a non-existing free floating G entity, with an existence that is implicitly assumed to be separate from any specific mapping or set of definitions). Or such proposals are extremely bad for individuals (which should not be surprising. Because there is no reason to expect that doing what one type of thing wants, would be good for a completely different type of thing).
In other words: all proposals analysed in the post are genuinely Group AI proposals. What I am trying to do is describe a feature shared by all Group AI proposals: that the successful implementation of a Group AI would be extremely bad for individuals. (Since all my readers are individuals, I don’t have to show that the proposals are objectively bad. A given proposal might be very good for some specific Group entity. But all my readers are individuals. So the fact that the AI in question might be good for some arbitrarily defined abstract G entity is not relevant).
I think that it is often possible to differentiate between a proposal based on a CEVH mapping on the one hand, and a misleadingly described proposal based on a CEVI mapping on the other hand. In other words: I think that it is often possible to separate a genuine Group AI proposal, from a proposed AI that effectively does what one individual wants that AI to do (but that is described in such a way that readers might get the mistaken impression that the AI will do what Humanity wants it to do). Let’s say that Dave is proposing to build an AI: Dave’s proposed AI (DAI), that will implement the Coherent Extrapolated Volition of something. And let’s say that we know that Dave is either proposing a genuine Group AI, or else Dave is using a misleading description of an AI that is effectively doing whatever one person wants that AI to do. One way to see what is actually being proposed, is to examine how Dave responds to a certain type of criticism of DAI.
Let’s say that Steve claims that a successfully implemented DAI would be bad for individuals. If Dave is at any point responding to such criticism by implying that this type of criticism is somehow self defeating. Or if Dave ever implies that if the resulting AI is bad for individuals, then this means that it is by definition not successfully implemented. Or if Dave ever implies that if a successfully implemented DAI is bad for individuals, then there must be some sort of definitional issue involved. Then Dave is presumably not thinking about anything along the lines of a CEVI project. Instead, Dave is presumably thinking about a genuine Group AI project. As explained in the post: if Dave defends a Group AI proposal in this way, then Dave is making a mistake. But if Dave defends a CEVI AI proposal in this way, then Dave is making a far more obvious mistake. So if we assume some basic levels of honesty and competence. Then this type of defence of DAI indicates that Dave is proposing an AI that is genuinely meant to do what a group wants that AI to do (in some sense).
A person proposing a CEVI project would presumably never claim that an AI that is bad for individuals, is by definition not successfully implemented (since this would be very obvious nonsense as a response to criticism of a CEVI project). If Steve says that a given AI is a bad idea, then a person proposing a CEVI project would never imply that this is somehow self-defeating or wrong per definition. It is presumably obvious to most people that a successfully implemented CEVI project can be catastrophically bad for individuals (in a way that is not due to any specific detail or definition).
In other words: It is presumably obvious to most people that doing what one individual wants can be bad for other individuals. But it seems to be more difficult to see that doing what an arbitrarily defined abstract Group entity wants, can also be bad for individuals. With this in mind, let’s say that we know that DAI will get its goal entirely from some set of humans. If Dave says that a successfully implemented DAI would per definition be good for individuals, then Dave is presumably not proposing an AI that effectively does what one person would want (because defending such an AI in this way would be obvious nonsense. Because it would be equivalent to claiming that doing what one individual wants can never be bad for other individuals).
In yet other words: consider the case where Dave is making claims along the lines of: if a successfully implemented DAI is bad for individuals, then this implies that DAI is built on top of bad definitions. This behaviour means that Dave is probably not proposing an AI that effectively does what one individual wants the AI to do. Dave’s behaviour is however consistent with Dave proposing a Group AI (and being confused in a less obvious way). It seems very unlikely that someone would claim that doing what one individual wants is per definition good for other individuals. Therefore, anyone that is in fact proposing an AI that would effectively do whatever one person wants, would presumably not make the type of arguments that Dave is making above (assuming a basic level of honesty and competence).
Let’s now look at how someone that is actually proposing a CEVI project might respond to this type of criticism. Mike is proposing to build Mike’s proposed AI (MAI), that will effectively do whatever one individual wants MAI to do. Let’s now say that Steve claims that MAI would be bad for individuals. Mike would presumably never respond by implying that such an outcome is somehow contradictory. Let’s look closer at the claim: “if a successfully implemented MAI hurts Steve, then MAI must be built on top of bad definitions”. This is not just nonsense. It is very obvious nonsense. Saying that if a successfully implemented MAI is catastrophically bad for individuals, then MAI must be built on top of some bad definitions, would be obvious nonsense. Mike would for example presumably never say anything along the lines of: “If you, Steve Smith, can see that MAI would be bad for individuals, then surely a clever AI will see this problem as well”. Mike would never say this, because that statement would be obvious nonsense (assuming a basic level of honesty and competence). It would be equivalent to claiming that doing what one individual wants cannot by definition be bad for other individuals.
My focus is not on people like Mike or proposals along the lines of MAI. My focus is on genuine Group AI proposals. When Dave is claiming that his Group AI proposal cannot by definition be bad for individuals, then Dave is also confused (as I explain at length in the post). But what Dave is saying is evidently not obvious nonsense. Which is why I thought that it made sense to explain exactly why Dave’s statements indicate confusion. Labels being used in problematic ways can make it more difficult to see this type of free-floating-G-entity confusion.
Let’s look at an analogy where the label Gavagai is being used in an inconsistent way to talk about cells and individuals (and where this usage in turn makes it more difficult to notice a type of confusion, that is analogous to the free-floating-G-entity confusion discussed in the post). There is nothing strange about an AI that does what Gregg wants, and in the process destroys every single one of Gregg’s cells (for example because Gregg wants to be uploaded, and would prefer that his cells are not left alive after uploading). This is unsurprising, because there is no reason to be surprised when doing what one type of thing wants (in this case Gregg), is bad for a completely different type of thing (in this case Gregg’s cells). This behaviour is not caused by any specific detail in the proposal to build an AI that does what Gregg wants that AI to do (in other words: the behaviour does not imply any form of implementation or definitional issue).
Claiming that a clever enough AI (that is designed to do what Gregg wants that AI to do), would surely be able to figure out that killing cells is a bad outcome is obvious nonsense. It seems very unlikely that anyone would say anything along the lines of: if you, Steve Smith, can see that the outcome would be bad for cells, then surely a clever AI would be able to see this too. It would be obvious nonsense to claim that if such an AI kills cells, then it is by definition not successfully implemented. If an AI that is meant to do what Gregg wants, ends up killing cells, then there is no reason at all to think that something has gone wrong. Because there is no reason at all to be surprised when doing what one type of thing wants, is bad for a completely different type of thing.
If Bob actually does claim that doing what Gregg wants cannot by definition be bad for Gregg’s cells. Then it would presumably be obvious that Bob is confused. If Allan however uses the label Gavagai to sometimes refer to Gregg, and at other times to refer to Gregg’s cells. Then it might be more difficult to notice that Allan is confused. There are many scenarios where a given statement would be reasonable for both meanings (for example: seatbelts are important to prevent bad things from happening to Gavagai). Let’s now say that Steve claims that an AI that does what Gregg wants this AI to do, will kill every one of Greggs’ cells (if that AI is successfully implemented). Steve further claims that this AI will behave like this, for reasons that are not related to any specific detail in the definitions (in other words: Steve claims that this behaviour is not due to any definitional issues). Now let’s say that Allan claims that any AI that behaves like this, would by definition not be successfully implemented. Allan justifies this with an argument that basically boils down to some claim along the lines of: doing what Gavagai wants cannot by definition be bad for Gavagai. In this case, Allan would be just as confused as Bob. But the fact that Allan is confused might not be equally obvious. Something similar might happen if a confused person is using the label CEV in a discussion about a specific AI proposal, without noticing that the label is being used to refer to different things at different times.
In yet other words: If Dave is at any point responding to criticism of his proposed AI project by implying that criticism is somehow self defeating. Or responds by saying that if his proposed AI is bad for individuals, then this means that the AI is by definition not successfully implemented. Or responds by saying that if individuals are hurt by a successfully implemented AI, then this must be due to some form of definitional problem. Then Dave is not thinking about anything along the lines of a CEVI project. Dave is not proposing an AI that effectively does what one individual wants that AI to do. Such responses would however be expected if Dave is confused, and is proposing an AI project that is genuinely meant to lead to the implementation of a Group AI.
This confusion about abstract Group entities that I keep talking about is making it difficult to discuss a certain class of AI proposals. But the confusion is not the actual problem that I am trying to deal with. The actual problem that I am trying to deal with, is that the proposals in question imply very bad outcomes (for individuals. In expectation. If successfully implemented. For reasons that are not related to any specific implementation detail or definitional choice). This should be an entirely unsurprising discovery. For the same reason that it is entirely unsurprising to discover that an AI that does what Gregg wants this AI to do, would kill every single one of Gregg’s cells (if successfully implemented. For reasons that are not related to any specific implementation detail or definitional choice).
The fact that these proposals would be bad for individuals, is a problem for my intended audience: individuals. In other words: the proposals are not objectively bad. A given proposal can be very good for whatever arbitrarily defined abstract Group entity is implied by the chosen set of definitions. But I am addressing human individuals. So I don’t try to argue that proposals are objectively bad. I only need to show that they are bad for human individuals. The confusion is problematic because it is making it difficult to discuss the real issue: the existence of AI proposals that imply massively worse than extinction outcomes. As I have argued elsewhere, there exists many plausible paths along which a Sovereign AI proposal might end up getting successfully implemented, before much progress has been made on understanding Sovereign AI proposals. This is the topic of the next section.
Regarding your comment about what type of alignment target is most likely to be pursued first:
My claim is that it is important to start analysing Sovereign AI proposals now. A reasonable sounding proposal might lead to a massively worse than extinction outcome (if successfully implemented). Noticing this is not guaranteed (as illustrated by the fact that PCEV was a fairly famous proposal for a long time, without anyone noticing that PCEV implies such an outcome). The relevant question for evaluating the claim that Sovereign AI proposals need to be analysed now, is whether or not a Sovereign AI proposal might at some point be successfully implemented. The prior implementation of some other AI is not necessarily relevant. Such an earlier AI is only relevant if that AI removes the need to analyse Sovereign AI proposals now. For example because it is assumed that such an AI would buy sufficient time. Or because it is assumed that such an AI would be very useful when analysing Sovereign AI proposals.
I have argued elsewhere that the successful implementation of a Limited AI (for example an instruction following AI) might not actually result in sufficient time (for example in this post, this post, and the last section of this comment). I have also argued that a successfully implemented Limited AI might not be very helpful when analysing Sovereign AI proposals (for example in my reply to the Seth Herd comment that you mention, and in this post). (One issue is that trying to ask a clever AI to find a good Sovereign AI proposal, is equivalent to trying to ask an AI what goal it should have). Relatedly, this post argues that augmented humans might not be very helpful when analysing Sovereign AI proposals (in brief: even if the augmentation process results in a mind that is better than baseline humans at hitting alignment targets, this does not mean that the resulting mind will be good at analysing alignment targets).
Analysing Sovereign AI proposals now might end up having no impact at all on the outcome. One unsurprising scenario is that a misaligned AI kills everyone for reasons that are completely unrelated to what the designers were trying to do. Another possibility is that an augmented human (or an AI assistant) will propose some new way of looking at things, that renders all past work by non-augmented humans irrelevant (and that is not building on such past work). But it is not safe to assume that anything along these lines will actually happen (and the position that analysing Sovereign AI proposals now is not needed, is implicitly built on top of such an assumption).
The basic problem is that it is clearly not currently possible to reliably determine whether or not a reasonable sounding proposal implies a massively worse than extinction outcome (as for example illustrated by the PCEV incident). This seems like a really bad place to be. It is also true that making incremental progress on risk reduction is a tractable research goal (as for example illustrated by the present post). In other words: we might end up with a massively worse than extinction outcome as a result of a successfully implemented AI Sovereign. The probability of this can be reduced by analysing Sovereign AI proposals. (See also this comment for an attempt to clarify the type of risk reduction that ATA is trying to accomplish).
A summary and an analogy: There exists many plausible paths that ends in a successfully implemented AI Sovereign. Even if the first AI is not an AI Sovereign, we might eventually end up with a successfully implemented AI Sovereign anyway. And we might not have a lot of time to analyse such proposals. And we might not get a lot of help with such analysis. Thus, analysing Sovereign AI proposals now reduces the probability of a massively worse than extinction outcome. This remains true despite the fact that ATA might end up having no impact on the outcome. Let’s take an analogy with Bill who finds himself in a war-zone and decides to look for a good bulletproof vest. The vest will not help if Bill is shot in the head. It is also possible that Bill will get shot in the stomach with a high caliber weapon that no vest can stop. But vests are still very popular amongst people that find themselves in war-zones. Because if you do get shot in the stomach, then it really is very nice to be wearing a vest (and the quality of that vest might make a huge difference).
Getting shot in the stomach is here analogous to a situation where (i): it becomes possible to successfully implement a Sovereign AI (for example with the help of something along the lines of: an instruction following superintelligent AI, augmented humans, a non-superintelligent AI assistant, etc), (ii): there is a time crunch (for example due to Internal Time Pressure), (iii): the ability to do ATA has not increased dramatically, and (iv): there exists a Sovereign AI proposal that implies an outcome massively worse than extinction (for example along the lines of the outcome implied by PCEV. But due to a problem that is more difficult to notice). It could be that this problem is simply not realistically findable (analogous to the high caliber weapon). But it could also be the case that the problem in question is realistically findable in time, iff ATA progress has advanced to a level that is realistically achievable (analogous to a bullet that is possible to stop with a realistically findable vest). This is what makes me think that the current situation (where there exists exactly zero people in the word dedicated to ATA) is a serious mistake.

ThomasCederborg 3 Mar 2025 17:45 UTC
1 point
0
in reply to: Martin Randall’s comment on: A problem shared by many different alignment targets
The proposals described in your points 1 and 2 are very different from any of the proposals that I am analysing in the post. I consider this to be a good thing. But I wanted to note explicitly that this discussion has now moved very far away from what was discussed in the post, and is best seen as a new discussion (a discussion that starts with the proposals described in your points 1 and 2). Making this clear is important, because it means that many points made in the post (and also earlier in this thread) do not apply to the class of proposals that we are now discussing.
In particular: all alignment targets analysed in the post are Group AIs. But the alignment target described in your point 1: Coherent Extrapolation of Equanimous Volition (CEEV), is not a Group AI. Given that the primary focus of the post is to analyse the Group AI idea, the analysis of CEEV below is best seen as starting a completely new discussion. Among other things, this means that many arguments from the post about Group AIs will probably not apply to CEEV. (CEEV is still very bad for individuals. Because it is still the case that no individual has any meaningful influence regarding the way in which CEEV adopts those preferences that refer to her. One specific issue is that some CEEV delegates will still prefer outcomes where heretics are punished, because some delegates will still have an aversion to unethical AIs. The issue is described in detail in the last section of this comment).
The rule for deciding which actions are available to Delegates during negotiations, described in your point 2, is also a large departure from anything discussed in the post. The described rule would accept actions, even though those actions would make things dramatically worse for some people. I think that this makes it a very different kind of rule, compared to Davidad’s proposed Pareto Improvement rule. The points that I made about Pareto Improvements in the post, and earlier in this thread, do not apply to this new class of rules. (The set of actions is still rendered empty by the rule described in your point 2, due to a large and varied set of hard constraints demanding that the AI must not be unethical. A single pair of such demands can render the set empty, by having incompatible views regarding what it means for an AI to be unethical. Some pairs of demands like this have nothing to do with utility inversion. The issue is described in detail in the next section of this comment).
It also makes sense to explicitly note here that with the rule described in your point 2, you have now started to go down the path of removing entire classes of constraints from consideration (as opposed to going down the path of looking for new Pareto Baselines). Therefore, my statement that the path that you are exploring is unlikely to result in a non-empty set no longer applies. That statement was expressing doubt about finding a usable Pareto Baseline that would result in a non-empty set. But in my view you are now doing something very different (and far more interesting) than looking for a usable Pareto Baseline that would result in a non-empty set.
I will spend most of this comment talking about the proposals described in your points 1 and 2. But let’s first try to wrap up the previous topics, starting with Bob2. Bob2 is only different from Bob in the sense that Bob2 does not see an AI that literally never acts as a person. I don’t see why Bob2′s way of looking at things would be strange or unusual. A thing that literally never acts can certainly be seen as a person. But it doesn’t have to be seen as a person. Both perspectives seem reasonable. These two different classifications are baked into a core value, related to the Dark Future concept. (In other words: Bob and Bob2 have different values. So there is no reason to think that learning new facts would make them agree on this point. Because there is no reason to think that learning new facts would change core values). In a population of billions, there will thus be plenty of people that share Bob2′s way of looking at such an AI. So if the AI is pointed at billions of humans, the set of Pareto Improvements will be rendered empty by people like Bob2 (relative to the alternative no-AI-action Pareto Baseline that you discussed here).
Now let’s turn to your point about the size of the action space. Most of my previous points probably do not apply to rules that will ignore entire classes of constraints (such as the “pathological constraints” that you mention). In that case everything depends on how one defines this class of constraints. Rules that do ignore classes of constraints are discussed in the next section of this comment. However: for rules that do not ignore any constraints, the number of actions is not necessarily relevant (in other words: while we are still talking about Pareto Improvements, the number of actions is not necessarily relevant). One can roughly describe the issue as: If one constraint demands X. And another constraint refuses X. Then the set is empty. Regardless of the number of actions.
I’m not sure whether or not there is any significant disagreement left on this issue. But I will still elaborate a bit more on how I see the original situation (the situation where pathological constraints are not ignored).
One can say that everything is short circuited by the fact that humans often have very strong opinions about who should be in charge. (And there are many different types of ontologies that are compatible with such sentiments. Which means that we can expect a great variety in terms of what this implies regarding demands about the AI). Wanting the right type of person to be in charge can be instrumental. But it does not have to be instrumental. And there is nothing unusual about demanding things that are entirely symbolic. (In other words: there is nothing unusual about Dennis, who demands that The Person in Charge must do or value things that have no connection with the material situation of Dennis).
This is not part of every ontology. But caring about who is in charge is a common human value (at least common enough for a population of billions to include a great variety of hard constraints related to this general type of sentiment). The number of actions does not help if one person rejects all trajectories where the person in charge is X. And another person rejects any trajectory unless the person in charge is X. (Combined with the classification of a trajectory that contains a non-misaligned and clever AI, that takes any action, as a trajectory where the first such AI is in charge). (I don’t know if we actually disagree on anything here. Perhaps you would classify all constraints along these lines as Pathological Constraints). (In the next section I will point out that while such incompatible pairs can be related to utility inversion. They do not have to be.)
I will first discuss the proposal described in your point 2 in the next section, and then discuss the proposal described in your point 1 in the last section (because finding the set of actions that are available to delegates happens before delegates start negotiating).
The rule for determining which set of actions will be included in negotiations between delegates
The rule described in your point 2 still results in an empty set, for the same reason that Davidad’s original Pareto Improvement rule results in an empty set. The rule described in your point 2 still does not remove the problem of Bob from the original thought experiment of the post. Because the thing that Bob objects to is an unethical AI. The issue is not about Bob wanting to hurt Dave, or about Bob wanting to believe that the AI is ethical (or that Bob might want to believe that Dave is punished. Or that Bob might want to see Dave being punished). The issue is that Bob does not want the fate of humanity to be determined by an unethical AI.
Demanded punishments also do not have to refer to Dave’s preferences. It can be the case that Gregg demands that Dave’s preferences are inverted. But it can also be the case that Gregg demands that Dave be subjected to some specific treatment (and this can be a treatment that Dave will categorically reject). There is nothing unexpected about a fanatic demanding that heretics be subjected to a specific type of treatment. It is not feasible to eliminate all “Problematic Constraints” along these lines by eliminating some specific list of constraint types (for example along the lines of: utility inverting constraints, or hostile constraints, or demands that people suffer). Which in combination with the fact that Dave still has no meaningful influence over those constraints that are about Dave, means that there is still nothing preventing someone from demanding that things happen to Dave, that Dave finds completely unacceptable. A single such constraint is sufficient for rendering the action space empty (regardless of the size of the action space).
When analysing this type of rule it might actually be best to switch to a new type of person, that has not been part of my past thought experiments. Specifically: the issue with the rule described in your point 2 can also be illustrated using a thought experiment that does not involve any preferences that in any way refer to any human. The basic situation is that two people have incompatible demands regarding how an AI must interact with a specific sacred place or object, in order for the AI to be considered acceptable.
Let’s take ancient Egyptian religion as an example in order to avoid contemporary politics. Consider Intef who was named after the Pharaoh who founded Middle Kingdom Egypt, and Ahmose who was named after the Pharaoh who founded New Kingdom Egypt. They both consider it to be a moral imperative to restore temples to their rightful state (if one has the power to do so). But they disagree on when Egyptian religion was right, and therefore disagree on what the AI must do to avoid being classified as unethical (in the sense of the Dark Future concept).
Specifically: a Middle Kingdom temple was destroyed and the stones were used to build a New Kingdom temple. Later that temple was also destroyed. Intef considers it to be a moral imperative to use the stones to rebuild the older temple (if one has the ability to do so). And Ahmose considers it to be a moral imperative to use the same stones to rebuild the newer temple (if one has the ability to do so). Neither of them thinks that an unethical AI is acceptable (after the AI is classified as unethical the rest of the story follows the same path as the examples with Bob or Bob2). So the set would still be empty, even if a rule simply ignores every constraint that in any way refers to any human.
Neither of these demands are in any way hostile (or vicious, or based in hate, or associated with malevolent people, or belligerent, or anything else along such lines). Neither of these demands is on its own problematic or unreasonable. On its own, either of these demands is in fact trivial to satisfy (the vast majority of people would presumably be perfectly ok with either option). And neither of these demands looks dangerous (nor would they result in an advantage in Parliamentarian Negotiations). Very few people would watch the world burn rather than let Intef use the original stones to rebuild his preferred temple. But it only takes one person like Ahmose to make the set of actions empty.
Let’s go through another iteration and consider AI47 who uses a rule that ignores some additional constraints. When calculating whether or not an action can be used in delegate negotiations, AI47 ignores all preferences that (i): refer to AI47 (thus completely ignoring all demands that AI47 not be unethical), or (ii): refer to any human, or (iii): are dangerous, or (iv): are based on hate / bitterness / spite / ego / etc / etc, or (v): make demands that are unreasonable or difficult to satisfy. Let’s say that in the baseline trajectory that alternative trajectories are compared to, AI47 never acts. If AI47 never acts, then this would lead to someone eventually launching a misaligned AI that would destroy the temple stones (and also kill everyone).
Intef and Ahmose both think that if a misaligned AI destroys the stones, then this counts as the stones being destroyed in an accident (comparable from a moral standpoint to the case where the stones are destroyed by an unpreventable natural disaster). Conditioned on a trajectory where the stones are not used to restore the right temple, both prefer a trajectory where the stones are destroyed by accident. (In addition to caring about the ethics of the AI that is in charge, they also care about the stones themselves.). And there is no way for a non-misaligned, clever AI (like AI47), to destroy the stones by accident (in a sense that they would consider to be equivalent to an unpreventable natural disaster). So the set is still empty.
In other words: even though this is no longer an attempt to find a usable Pareto Baseline that simultaneously satisfies many trillions of hard constraints, a single pair of constraints can still make the set empty. And it is still an attempt to deal with a large set of hard constraints, defined in a great variety of ontologies. It is also still true that (in addition to constraints coming from people like Intef and Bob2) this set will also include constraints defined in many ontologies that we will not be able to foresee (including the ontologies of a great variety of non-neurotypical individuals, that have been exposed to a great variety of value systems during childhood). This is an unusual feature of the AI context (compared to other contexts that deal with human preferences). A preference defined in an ontology that no one ever imagined might exist, has no impact on debates about economic policy. But unless one simply states that a rule should ignore any preference that was not considered by the designers, then the quest to find a rule that actually implies a non-empty set, must deal with this highly unusual feature of the AI context.
(Intef and Ahmose pose a lot more problems in this step, than they pose in the step where delegates are negotiating. In that later step, their delegates have no problematic advantage. Their delegates are also not trying to implement anything worse than extinction. This is probably why this type of person has not been part of any of my past thought experiments. I have not thought deeply about people like Intef and Ahmose)
(There exists several contemporary examples of this general type of disagreements over sacred locations or objects. Even the specific example of reusing temple stones was a common behaviour in many different times and places. But the ancient Egyptians are the undisputed champions of temple stone reuse. And people nowadays don’t really have strong opinions regarding which version of ancient Egyptian religion is the right version. Which is why I think it makes sense to use this example)
(I’m happy to keep exploring this issue. I would not be surprised if this line of inquiry leads to some interesting insight)
(if you are looking for related literature, you might want to take a look at the Sen ``Paradox″ (depending on how one defines “pathological preferences”, they may or may not be related to “nosy preferences”))
(Technical note: this discussion makes a series of very optimistic assumptions in order to focus on problems that remain despite these assumptions. For example assuming away a large number of very severe definitional issues. Reasoning from such assumptions does not make sense if one is arguing that a given proposal would work. But it does make sense when one is showing that a given proposal fails, even if one makes such optimistic assumptions. This point also applies to the next section)
Coherent Extrapolation of Equanimous Volition (CEEV)
Summary: In the CEEV proposal described in your point 1, many different types of fanatics would still be represented by delegates that want outcomes where heretics are punished. For example fanatics that would see a non-punishing AI as unethical. Which means that CEEV still suffers from the problem that was illustrated by the original PCEV thought experiment. In other words: having utility inverting preferences is one possible reason to want an outcome where heretics are punished. Such preferences would not be present in CEEV delegates. But another reason to want an outcome where heretics are punished is a general aversion to unethical AIs. Removing utility inverting preferences from CEEV delegates would not remove their aversion to unethical AIs. Yet another type of sentiment that would be passed on to CEEV delegates, is the case where someone would want heretics to be subjected to some specific type of treatment (simply because, all else equal, it would be sort of nice if the universe ended up like this). There are many other types of sentiments along these lines that would also be passed on to CEEV delegates (including a great variety of sentiments that we have no hope of comprehensively cataloguing). Which means that many different types of CEEV delegates would still want an outcome where heretics are hurt. All of those delegates would still have a very dramatic advantage in CEEV negotiations.
Let’s start by noting that fanatics can gain a very dramatic negotiation advantage in delegate negotiations, without being nearly as determined as Gregg or Bob. Unlike the situation discussed in the previous section, in delegate negotiations people just need to weakly prefer an outcome where heretics are subjected to some very unpleasant treatment. In other words: people can gain a very dramatic negotiation advantage simply because they feel that (all else equal) it would be sort of nice to have some type of outcome, that for some reason involves bad things happening to heretics.
There exists a great variety of reasons for why someone might have such sentiments. In other words: some types of fanatics might lose their negotiation advantage in CEEV. But many types of fanatics would retain their advantage (due to a great variety of preferences defined in a great variety of ontologies). Which in turn means that CEEV suffers from the same basic problem that PCEV suffers from.
You mention the possibility that an AI might lie to a fanatic regarding what is happening. But a proposed outcome along such lines would change nothing. CEEV delegates representing fanatics that have an aversion to unethical AIs would for example have no reason to accept such an outcome. Because the preferences of the fanatics in question is not about their beliefs regarding unethical AIs. Their preferences are about unethical AIs.
In addition to fanatics with an aversion to unethical AIs, we can also look at George, who wants heretics to be punished as a direct preference (without any involvement of preferences related to unethical AIs). George might for example want all heretics to be subjected to some specific treatment (demands that heretics be subjected to some specific treatment are not unusual). No need for anything complicated or deeply felt. George might simply feel that it would be sort of nice if the universe would be organised like this (all else equal).
George could also want the details of the treatment to be worked out by a clever AI (without referring to any form of utility inversion or suffering. Or even referring in any way to any heretic, when specifying the details of the treatment). George might for example want all heretics to be put in whatever situation, that would make George feel the greatest amount of regret. In other words: this type of demand does not have to be related to any form of utility inversion. The details of the treatment that George would like heretics to be subjected to, does not even need to be determined by any form of reference to any heretic. In yet other words: there are many ways for fanatics along the lines of George to gain a very large negotiation advantage in CEEV. (The proposal that CEEV might lie to George about what is happening to heretics would change nothing. Because George’s preference is not about George’s beliefs.)
The type of scenario that you describe, where George might want to see Dave being hurt, is not actually an issue here. Let’s look more generally at George’s preferences regarding George’s experiences, George’s beliefs, George’s world model, etc. None of those pose a problem in original PCEV (because they do not result in a negotiation advantage for George’s delegate). (We might not have any actual disagreement regarding these types of preferences. I just wanted to be clear about this point).
From the perspective of Steve, the underlying issue with CEEV is that Steve still has no meaningful control over the way in which CEEV adopts those preferences that refer to Steve. Which in turn means that Steve still has no reason to think that CEEV will want to help Steve, as opposed to want to hurt Steve. This point would remain true even if one were to remove additional types of preferences from delegates.
Eliminating some specific list of preference types (for example along the lines of: utility inverting preferences, or hostile preferences, or preferences that people suffer, etc) does not qualitatively change this situation. Because eliminating such a list of preference types does not result in Steve gaining meaningful influence regarding the adoption of those preferences that refer to Steve. Which in the case of Parliamentarian Negotiations means that delegates will still want to hurt Steve, for a great variety of reasons (for example due to sentiments along the lines of an aversion to unethical AIs. And also due to a long and varied list of other types of sentiments, that we have no hope of exhaustively cataloguing).
In other words: all those delegates that (for reasons related to a great variety of sentiments) still want outcomes where people are subjected to horrific forms of treatment, will still have a very large negotiation advantage in CEEV. And such delegates will also have a very large negotiation advantage in any other proposal without the SPADI feature, that is based on the idea of eliminating some other specific list of preference types from delegates.
Since this discussion is exploring hypotheticals (as a way of reaching new insights), I’m happy to keep looking at proposals without the SPADI feature. But given the stakes, I do want to make a tangential point regarding plans that are supposed to end with a successfully implemented AI without the SPADI feature (presumably as the end point of some larger plan that includes things along the lines of: an AI pause, augmented humans, an initial Limited AI, etc, etc).
In other words: I am happy to keep analysing proposals without the SPADI feature. Because it is hard to predict what one will find when one is pulling on threads like this. And because analysing a dangerous proposal reduces the probability of it being implemented. But I also want to go on a tangent and explain why successfully implementing any AI without the SPADI feature would be extremely bad. And explicitly note that this is true regardless of which specific path one takes to such an AI. And also explicitly note that this is true, regardless of whether or not anyone manages to construct a specific thought experiment illustrating the exact way in which things go bad.
Let’s look at a hypothetical future proposal to illustrate these two points. Let’s say that someone proposes a plan that is supposed to eventually lead to the implementation of an AI that gets its preferences from billions of humans. This AI does not have the SPADI feature. Now let’s say that this proposed alignment target avoids the specific issues illustrated by all existing thought experiments. Let’s further say that no one is able to construct a specific thought experiment that illustrates exactly how this novel alignment target proposal would lead to a bad outcome. The absence of a thought experiment that illustrates the specific path to a bad outcome, would not in any way shape or form imply that the resulting AI does not want to hurt Steve, if such a proposed plan is successfully implemented. In other words: since Steve will have no meaningful influence regarding the adoption of those preferences that refer to Steve, Steve will have no reason to expect the actual resulting AI to want to help Steve, as opposed to want to hurt Steve. PCEV implied a massively worse than extinction outcome, also before the specific problem was described (and PCEV spent a lot of years as a fairly popular proposal without anyone noticing the issue).
In yet other words: the actual AI, that is actually implied, by some proposed set of definitions, can end up wanting to hurt Steve, regardless of whether or not someone is able to construct a thought experiment that illustrates the exact mechanism by which this AI will end up wanting to hurt Steve. Which in combination with the fact that Steve does not have any meaningful influence regarding the adoption of those preferences that refer to Steve, means that Steve has no reason to expect this AI to want to help Steve, as opposed to want to hurt Steve.
In yet other words: the SPADI feature is far from sufficient for basic safety. But it really is necessary for basic safety. Which in turn means that if a proposed AI does not have the SPADI feature, then this AI is known to be extremely bad for human individuals in expectation (if successfully implemented). This is true with or without a specific thought experiment illustrating the specific mechanism that would lead to this AI wanting to hurt individuals. And it is true regardless of what path was taken to the successful implementation of such an AI. (Just wanted to be explicit about these points. Happy to keep analysisng proposals without the SPADI feature.)
(you might also want to take a look at this post)

ThomasCederborg 9 Feb 2025 20:23 UTC
3 points
0
in reply to: Martin Randall’s comment on: A problem shared by many different alignment targets
There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). This is also a perfectly reasonable ontology. A single person like Bob2 is enough to make the set of Pareto Improvements relative to your proposed Pareto Baseline empty.
(As a tangent, I just want to explicitly note here that this discussion is about Pareto Baselines. Not Negotiation Baselines. The negotiation baseline in all scenarios discussed in this exchange is still Yudkowsky’s proposed Random Dictator negotiation baseline. The Pareto Baseline is relevant to the set of actions under consideration in the Random Dictator negotiation baseline. But it is a distinct concept. I just wanted to make this explicit for the sake of any reader that is only skimming this exchange)
The real thing that you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies (including a large number of non-standard ontologies. Some presumably a lot more strange than the ontologies of Bob and Bob2). The concept of a Pareto Improvement was really not designed to operate in a context like this. It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context. Few concepts have actually been properly explored in the AI context (this is related to the fact that the Random Dictator negotiation baseline actually works perfectly fine in the context that it was originally designed for: a single individual trying to deal with Moral Uncertainty. Something similar is also true for the Condorcet Criterion. The intuition failures that seem to happen when people move concepts from CEVI style mappings to CEVH style mappings is also related. Etc, etc, etc. It simply does not seem to exist a workable alternative, to actually exploring a concept, in whatever AI context that one wants to use it in. Simply importing concepts from other contexts, just does not seem to be a reliable way of doing things. This state of affairs is extremely inconvenient).
Let’s consider the economist Erik, who claims that Erik’s Policy Modification (EPM) is a Pareto Improvement over current policy. Consider someone pointing out to Erik that some people want heretics to burn in hell, and that EPM would be bad for such people, since it would make life better for heretics in expectation. If Erik does decide to respond, he would presumably say something along the lines of: it is not the job of economic policy to satisfy people like this. He probably never explicitly decided to ignore such people. But his entire field is based on the assumption that such people do not need to be taken into consideration when outlining economic policy. When having a political argument about economic policy, such people are in fact not really an obstacle (if they do participate, they will presumably oppose EPM with arguments that do not mention hellfire). The implicit assumption that such positions can be ignored thus holds in the context of debating economic policy. But this assumption breaks when we move the concept to the AI context (where every single type of fanatic is informed, extrapolated, and actually given a very real, and absolute, veto over every single thing that is seen as important enough).
Let’s look a bit at another Pareto Baseline that might make it easier to see the problem from a different angle (this thought experiment is also relevant to some straightforward ways in which one might further modify your proposed Pareto Baseline in response to Bob2). Consider the Unpleasant Pareto Baseline (UPB). In UPB the AI implements some approximation of everyone burning in hell (specifically: the AI makes everyone experience the sensation of being on fire for as long as it can). It turns out that it only takes two people to render the set of Pareto Improvements relative to UPB empty: Gregg and Jeff from my response to Davidad’s comment. Both want to hurt heretics, but they disagree about who is a heretic. Due to incompatibilities in their respective religions, every conceivable mind is seen as a heretic by at least one of them. Improving the situation of a heretic is Not Allowed. Improving the situation of any conceivable person, in any conceivable way, is thus making things worse from the perspective of at least one of them.
Gregg and Jeff do have to be a lot more extreme than Bob or Bob2. They might for example be non-neurotypical (for example sharing a condition that has not yet been discovered). And raised in deeply religious environments, whose respective rules they have adopted in an extremely rigid way. So they are certainly rare. But there only needs to be two people like this for the set of Pareto Improvements relative to UPB to be empty. (presumably no one would ever consider building an AI with UPB as a Pareto Baseline. This thought experiment is not meant to illustrate any form of AI risk. It’s just a way of illustrating a point about attempting to simultaneously satisfy trillions of hard constraints, defined in billions of ontologies)
(I really appreciate you engaging on this in such a thorough and well thought out manner. I don’t see this line of reasoning leading to anything along the lines of a workable patch or a usable Pareto Baseline. But I’m very happy to keep pulling on these threads, to see if one of them leads to some interesting insight. So by all means: please keep pulling on whatever loose ends you can see)

ThomasCederborg 9 Feb 2025 19:03 UTC
3 points
0
in reply to: Seth Herd’s comment on: A problem shared by many different alignment targets
Given that you agreed with most of what I said in my reply, it seems like you should also agree that it is important to analyse these types of alignment targets. But in your original comment you said that you do not think that it is important to analyse these types of alignment targets.
Let’s write Multi Person Sovereign AI Proposal (MPSAIP) for an alignment target proposal to build an AI Sovereign that gets its goal from the global population (in other words: the type of alignment target proposals that I was analysing in the post). I followed your links and can only find one argument against the urgency of analysing MPSAIPs now: that an Instruction Following AI (IFAI) would make this unnecessary. I can see why one might expect that an IFAI would help to some degree when analysing MPSAIPs. But I don’t see how the idea of an IFAI could possibly remove the urgent need to analyse MPSAIPs now.
In your post on distinguishing value alignment from intent alignment, you define value alignment as being about all of humanity’s long term, implicit deep values. It thus seems like you are not talking about anything along the lines of building an AI that will do whatever some specific person wants that AI to do. Please correct me if I’m wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity’s long term, implicit deep values.
A brief summary of why I think that this is false: You simply cannot delegate the task of picking a goal to an AI (no matter how clever this AI is). You can define the goal indirectly and have the AI work out the details. But the task is simply not possible to delegate. For the same reason: you simply cannot delegate the task of picking a MPSAIP to an AI (no matter how clever this AI is). You can define things indirectly and have the AI work out the details. This is equivalent to fully solving the field of MPSAIP analysis. It would for example necessarily involve defining some procedure for dealing with disagreements amongst individuals that disagree on how to deal with disagreements (because individuals will not agree on which MPSAIP to pick). PCEV is one such procedure. It sounds reasonable but would lead to an outcome far worse than extinction. VarAI is another procedure that sounds reasonable but that is in fact deeply problematic. As shown in the post, this is not easy (partly because intuitions about well known concepts tend to break when transferred to the AI context). In other words: you can’t count on an IFAI to notice a bad MPSAIP, for the same reason that you can’t count on Clippy to figure out that it has the wrong goal.
How useful would an IFAI be for analysing MPSAIPs?
I can see why one might think that an IFAI would be somewhat useful. But I don’t see how one can be confident that it would be very useful (let alone be equivalent to a solution). If one does not hold this position, then the existence of an IFAI does not remove the need to analyse MPSAIPs now. (The idea that an IFAI might be counted on to buy sufficient time to analyse MPSAIPs is covered below, in the section where I answer your question about an AI pause).
The idea that an IFAI would be extremely useful for Alignment Target Analysis seems to be very common. But there is never any actual reason given for why this might be true. In other words: while I have heard similar ideas many times, I have never been able to get any actual argument in favour of the position, that an IFAI would be very useful for analysing MPSAIPs (by you or by anyone else). It is either implicit in some argument, or just flatly asserted. There seems to be two versions of this idea. One version is the delegation plan. In other words: the plan where one builds an IFAI that does know how to describe all of humanity’s long term, implicit deep values. The other version is the assistant plan. In other words: the plan where one builds an IFAI that does not know how to describe all of humanity’s long term, implicit deep values (and then uses that IFAI as an assistant while analysing MPSAIPs). I will cover them separately below.
The delegation plan: The scenario where an IFAI does know how to define all of humanity’s long term, implicit deep values
I don’t know how this plan could possibly remove the need for analysing MPSAIPs now. I don’t know why anyone would believe this (similarly to how I don’t know why anyone would believe that Clippy can be counted on to figure out that it has the wrong goal). It is clearly a common position. But as far as I am aware, there exists no positive argument for this position. Without any actual argument in favour of this position, it is a bit tricky to argue against this position. But I will do my best.
A preliminary point is that the task of picking one specific mapping, that maps from billions of humans to an entity of the type that can be said to want things, is not a technical task with a findable solution (see the post for much more on this). In yet other words: if one were to actually describe in detail the argument that one can delegate the task of analysing MPSAIPs to an IFAI, then one would run into a logical problem (if one tried to actually spell out the details step by step, one would be unable to do so). The problem one would run into, would be the same problem that one would run into if one were to try to argue that Clippy will figure out that it has the wrong goal (if one tried to actually spell out the details step by step, one would be unable to do so). Neither finding the correct goal nor analysing MPSAIPs is a technical task with a findable solution. Thus, neither task can be delegated to an AI, no matter how clever it is.
Let’s say that we have an IFAI that is able to give an answer, when you ask it how to describe all of humanity’s long term, implicit deep values. This is equivalent to the IFAI having already picked a specific MPSAIP.
I see only two ways of arriving at such an IFAI. One is that something has gone wrong, and the IFAI has settled on an answer by following some process that the designers did not intend it to follow. This is a catastrophic implementation failure. In other words: unless the plan was for the IFAI to choose an MPSAIP using some unknown procedure, the project has not gone according to plan. In this case I see no particular reason to think that the outcome would be any better than the horrors implied by PCEV.
The only other option that I see is that the designers have already fully solved the problem of how to define all of humanity’s long term, implicit deep values (presumably indirectly, by defining a process that leads to such a definition). In other words: if one plans to build an IFAI like this, then one has to fully solve the entire field of analysing MPSAIPs, before one builds the IFAI. In yet other words: if this is the plan, then this plan is an argument in favour of the urgent need to analyse MPSAIPs.
The assistant plan: The scenario where an IFAI does not know how to define all of humanity’s long term, implicit deep values
To conclude that analysing MPSAIPs now is not urgent, one must assume that this type of IFAI assistant is guaranteed to have a very dramatic positive effect (a somewhat useful IFAI assistant would not remove the urgent need for analysing MPSAIPs now). It seems to be common to simply assume that an IFAI assistant will basically render prior work on analysing MPSAIPs redundant (the terminology differs. And it is often only implicit in some argument or plan. But the assumption is common). I have however never seen any detailed plan for how this would actually be done. (The situation is similar to how the delegation plan is never actually spelled out). I think that as soon as one were to lay out the details of how this would work, one would realise that one has a plan that is built on top of an incorrect assumption (similar to the type of incorrect implicit assumption that one would find, if one were to spell out the details of why exactly Clippy can be counted on to realise that it has the wrong goal).
It is difficult to argue against this position directly, since I don’t know how this IFAI is supposed to be used (let alone why this would be guaranteed to have a large positive effect). But I will try to at least point to some difficulties that one would run into.
Let’s say that Allan is asking the IFAI questions, as a step in the process of analysing MPSAIPs. Every question Allan asks of an IFAI like this would pose a very dramatic risk. Allan is leaning heavily on a set of definitions, for example definitions of concepts like Explanation and Understanding. Even if those definitions have held up while the IFAI was used to do other things (such as shutting down competing AI projects), those definitions could easily break when discussing MPSAIPs. Since the IFAI does not know what a bad MPSAIP is, the IFAI has no way of noticing that it is steering Allan towards a catastrophically bad MPSAIP. Regardless of how clever the IFAI is, there is simply no chance of it noticing this. Just as there is no chance of Clippy discovering that it has the wrong goal.
In other words: if a definition of Explanation breaks during a discussion with an IFAI, and Allan ends up convinced that he must implement PCEV, then we will end up with the horrors implied by PCEV. (If you think that the IFAI will recognise the outcome implied by PCEV as a bad outcome, then you are imagining the type of IFAI that was treated in the previous subsection (and such an IFAI can only be built after the field of analysing MPSAIPs have been fully solved)). This was previously discussed here and here (using different terminology).
(To be clear: this subsection is not arguing against the plan of building an IFAI of this type. And it is not arguing against the idea that this type of IFAI might be somewhat useful. It is not even arguing against the idea that it might be possible to use an IFAI like this in a way that dramatically increases the ability to analyse MPSAIPs. It is simply arguing against the idea that one can be sure that an IFAI like this will in fact be used in a way that will dramatically increase the ability to analyse MPSAIPs. This is enough to show that the IFAI idea does not remove the urgent need to analyse MPSAIPs now).
Regarding the probability of a pause
The probability of a politically enforced pause is not important for any argument that I am trying to make. Not much changes if we replace a politically enforced pause with an IFAI. Some group of humans will still decide what type of Sovereign AI will eventually be built. If they successfully implement a bad Sovereign AI proposal, then the outcome could be massively worse than extinction. So it makes sense to reduce the probability of that. One tractable way of reducing this probability is by analysing MPSAIPs.
In other words: if you achieve a pause by doing something other than building an AI Sovereign (for example by implementing a politically enforced pause, or by using an IFAI). Then the decision of what AI Sovereign to eventually build will remain in human hands. So then you will still need progress on analysing MPSAIPs to avoid bad Sovereign AI proposals. There is no way of knowing how long it will take to achieve the needed level of such progress. And there is no way of knowing how much time a pause will actually result in. So, even if we did know exactly what method will be used to shut down competing projects. And we also knew exactly who will make decisions regarding Sovereign AI. Then there is still no way of knowing that there will be sufficient time to analyse MPSAIPs. Therefore, such analysis should start now. (And as illustrated by my post, such progress is tractable).
One point that should be made here, is that you can end up with a multipolar world even if there is a single IFAI that flawlessly shuts down all unauthorised AI projects. If a single IFAI is under the control of some set of existing political power structures, then this would be a multipolar world. Regardless of who is in control (for example the UN Security Council (UNSC), the UN general assembly, or some other formalisation of global power structures), it is still possible for some ordinary political movement to gain power over the IFAI, by ordinary political means. Elected governments can be voted out. Governments along the lines of the USSR can evidently also be brought down by ordinary forms of political movements. So there is in general nothing strange about someone being in control of an IFAI, but finding themselves in a situation where they must either act quickly and decisively, or risk permanently losing control to people with different values. This means that shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure.
Let’s consider the scenario where a UNSC resolution is needed to ask the IFAI a question, or to ask the IFAI to do something (such as shutting down competing AI projects, or launching an AI Sovereign). There is currently an agreement of what AI Sovereign to build. But there is also an agreement that it would be good to first analyse this proposal a bit more, to make sure there is no hidden problem with it. In this case, losing control of any of the five countries with a veto would remove the ability to launch an AI Sovereign (if control is lost to a novel and growing political movement, then control could be lost permanently. The result of losing control of one permanent UNSC member could mean that a deadlock will persist until the new movement eventually controls all five). So, the people currently in control would basically have to either act quickly or risk permanently losing power to people with different values. If they decide to aim at their preferred MPSAIP, then it would be very nice if the field of analysing MPSAIPs had progressed to the point where it is possible to notice that this MPSAIP implies an outcome worse than extinction (for example along the lines of the outcome implied by PCEV. But presumably due to a harder-to-notice problem).
I used the UNSC as an example in the preceding paragraph, because it seems to me like the only legal way of taking the actions that would be necessary to robustly shut down all competing AI projects (being the only legal option, and thus a sort of default option, makes it difficult to rule out this scenario). But the same type of Internal Time Pressure might also arise in other arrangements. This comment outlines a scenario where a global electorate is in charge (which seems like another reasonable candidate for how to define what it means to do the default thing). This post outlines a scenario where a group of augmented humans are in charge (in that scenario buying time is achieved by uploading. Not by shutting down competing AI projects. This seems like something that someone might do if they don’t feel comfortable with using force. But simultaneously don’t feel ready to take the decision to give up control to some specific political process).
The reason that I keep going on about the need for Alignment Target Analysis (ATA) is that there seems to currently exist exactly zero people in the world devoted to doing ATA full time. Making enough ATA progress to reduce the probability of bad outcomes is also tractable (trying to solve ATA would be a completely different thing. But there still exists a lot of low hanging fruit in terms of ATA progress that reduces the probability of bad outcomes). It thus seems entirely possible to me that we will end up with a PCEV style catastrophe that could have been easily prevented. Reducing the probability of that seems like a reasonable thing to do. But it is not being done.
An attempt to summarise how I view the situation
At our current level of ATA progress it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction. I simply don’t see how one can think that it is safe to stay at this level of progress. Intuitively this seems like a dangerous situation. The fact that there exists no research project dedicated to improving this situation seems like a mistake (as illustrated by my post, reducing the probability of bad outcomes is a tractable research project). It seems like many people do have some reason for thinking that the current state of affairs is acceptable. As far as I can tell however, these reasons are not made public. This is why I think that it makes sense to spend time on trying to figure out what you believe to be true, and why you believe it to be true (and this is also why I appreciate you engaging on this).
In other words: arguing that ATA should be a small percentage of AI safety work would be one type of argument. Arguing that the current situation is reasonable would be a fundamentally different type of argument. It is clearly the case that plenty of people are convinced that it is reasonable to stay at the current level of ATA progress (in other words: many of people are acting in a way that I can only explain if I assume that they feel very confident, that it is safe to stay at our current level of ATA progress). I think that they are wrong about this. But since no argument in favour of this position is ever outlined in detail, there is no real way of addressing this directly.
PS:
I’m fine with continuing this discussion here. But it probably makes sense to at least note that it would have fitted better under this post (which makes the case for analysing this type of alignment targets. And actually discusses the specific topic of why various types of Non-Sovereign-AIs would not replace doing this now). As a tangent, the end of that post actually explicitly asked people to outline their reasons for thinking that ATA now is not needed. Your response here seems to be an example of this. So I very much appreciate your engagement on this. In other words: I don’t think you are the only one that have ideas along these lines. I think that there are plenty of people with similar ways of looking at things. And I really wish that those people would clearly outline their reasons for thinking that the current situation is reasonable. Because I think that those reasons will fall apart if they are outlined in any actual detail. So I really appreciate that you are engaging on this. And I really wish that more people would do the same.

ThomasCederborg 30 Jan 2025 9:16 UTC
3 points
0
in reply to: Martin Randall’s comment on: A problem shared by many different alignment targets
I’m sorry if the list below looks like nitpicking. But I really do think that these distinctions are important.
Bob holds 1 as a value. Not as a belief.
Bob does not hold 2 as a belief or as a value. Bob thinks that someone as powerful as the AI has an obligation to punish someone like Dave. But that is not the same as 2.
Bob does not hold 3 as a belief or as a value. Bob thinks that for someone as powerful as the AI, the specific moral outrage in question renders the AI unethical. But that is not the same as 3.
Bob does hold 4 as a value. But it is worth noting that 4 does not describe anything load-bearing. The thought experiment would still work even if Bob did not think that the act of creating an unethical agent that determines the fate of the world is morally forbidden. The load-bearing part is that Bob really does not want the fate of the world to be determined by an unethical AI (and thus prefers the scenario where this does not happen).
Bob does not hold 5 as a belief or as a value. Bob prefers a scenario without an AI, to a scenario where the fate of the world was determined by an unethical AI. But that is not the same as 5. The description I gave of Bob does not in any way conflict with Bob thinking that most morally forbidden acts can be compensated for by expressing sincere regret at some later point in time. The description of Bob would even be consistent with Bob thinking that almost all morally forbidden acts can be compensated for by writing a big enough check. He just thinks that the specific moral outrage in question, directly means that the AI committing it is unethical. In other words: other actions are simply not taken into consideration, when going from this specific moral outrage, to the classification of the AI as unethical. (He also thinks that a scenario where the fate of the world is determined by an unethical AI is really bad. This opinion is also not taking any other aspects of the scenario into account. Perhaps this is what you were getting at with point 5).
I insist on these distinctions because the moral framework that I was trying to describe, is importantly different from what is described by these points. The general type of moral sentiment that I was trying to describe is actually a very common, and also a very simple, type of moral sentiment. In other words: Bob’s morality is (i): far more common, (ii): far simpler, and (iii): far more stable, compared to the morality described by these points. Bob’s general type of moral sentiment can be described as: a specific moral outrage renders the person committing it unethical in a direct way. Not in a secondary way (meaning that there is for example no summing of any kind going on. There is no sense in which the moral outrage in question is in any way compared to any other set of actions. There is no sense in which any other action plays any part whatsoever when Bob classifies the AI as unethical).
In yet other words: the link from this specific moral outrage to classification as unethical is direct. The AI doing nice things later is thus simply not related in any way to this classification. Plenty of classifications work like this. Allan will remain a murderer, no matter what he does after committing a murder. John will remain a military veteran, no matter what he does after his military service. Jeff will remain an Olympic gold winner, no matter what he does after winning that medal. Just as for Allan, John, and Jeff, the classification used to determine that the AI is unethical is simply not taking other actions into account.
The classification is also not the result of any real chain of reasoning. There is no sense in which Bob first concludes that the moral outrage in question should be classified as morally forbidden, followed by Bob then deciding to adhere to a rule which states that all morally forbidden things should lead to the unethical classification (and Bob has no such a rule).
This general type of moral sentiment is not universal. But it is quite common. Lots of people can think of at least one specific moral outrage that leads directly to them viewing a person committing it as unethical (at least when committed deliberately by a grownup that is informed, sober, mentally stable, etc). In other words: lots of people would be able to identify at least one specific moral outrage (perhaps out of a very large set of other moral outrages). And say that this specific moral outrage directly implies that the person is unethical. Different people obviously do not agree on which subset of all moral outrages should be treated like this (even people that agree on what should count as a moral outrage can feel differently about this). But the general sentiment where some specific moral outrage simply means that the person committing it is unethical is common.
The main reason that I insist on the distinction is that this type of sentiment would be far more stable under reflection. There are no moving parts. There are no conditionals or calculations. Just a single, viscerally felt, implication. Attached directly to a specific moral outrage. For Bob, the specific moral outrage in question is a failure to adhere to the moral imperative to punish people like Dave.
Strong objections to the fate of the world being determined by someone unethical are not universal. But this is neither complex nor particularly rare. Let’s add some details to make Bob’s values a bit easier to visualise. Bob has a concept that we can call a Dark Future. It is basically referring to scenarios where Bad People win The Power Struggle and manage to get enough power to choose the path of humanity (powerful anxieties along these lines seem quite common. And for a given individual it would not be at all surprising if something along these lines eventually turn into a deeply rooted, simple, and stable, intrinsic value).
A scenario where the fate of the world is determined by an unethical AI is classified as a Dark Future (again in a direct way). For Bob, the case with no AI does not classify as a Dark Future. And Bob would really like to avoid a Dark Future. People who thinks that it is more important to prevent bad people from winning than to prevent the world from burning might not be very common. But there is nothing complex or incoherent about this position. And the general type of sentiment (that it matters a lot who gets to determine the fate of the word) seems to be very common. Not wanting Them to win can obviously be entirely instrumental. An intrinsic value might also be overpowered by survival instinct when things get real. But there is nothing surprising about something like this eventually solidifying into a deeply held intrinsic value. Bob does sound unusually bitter and inflexible. But there only needs to be one person like Bob in a population of billions.
To summarise: a non punishing AI is directly classified as unethical. Additional details are simply not related in any way to this classification. A trajectory where an unethical AI determines the fate of humanity is classified as a Dark Future (again in a direct way). Bob finds a Dark Future to be worse than the no AI scenario. If someone were to specifically ask him, Bob might say that he would rather see the world to burn than see Them win. But if left alone to think about this, the world burning in the non-AI scenario is simply not the type of thing that is relevant to the choice (when the alternative is a Dark Future).
Regarding the probability that extrapolation will change Bob:
First I just want to again emphasise that the question is not if extrapolation will change one specific individual named Bob. The question is whether or not extrapolation will change everyone with these types of values. Some people might indeed change due to extrapolation.
My main issue with the point about moral realism is that I don’t see why it would change anything (even if we only consider one specific individual, and also assume moral realism). I don’t see why discovering that The Objectively Correct Morality disagrees with Bob’s values would change anything (I strongly doubt that this sentence means anything. But for the rest of this paragraph I will reason from the assumption that it both does mean something, and that it is true). Unless Bob has very strong meta preferences related to this, the only difference would presumably be to rephrase everything in the terminology of Bob’s values. For example: extrapolated Bob would then really not want the fate of the world to be determined by an AI that is in strong conflict with Bob’s values (not punishing Dave directly implies a strong value conflict. The fate of the world being determined by someone with a strong value conflict directly implies a Dark Future. And nothing has changed regarding Bob’s attitude towards a Dark Future). As long as this is stronger than any meta preferences Bob might have regarding The Objectively Correct Morality, nothing important changes (Bob might end up needing a new word for someone that is in strong conflict with Bob’s values. But I don’t see why this would change Bob’s opinion regarding the relative desirability of a scenario that contains a non-punishing AI, compared to the scenario where there is no AI).
I’m not sure what role coherence arguments would play here.
Regarding successor AIs:
It is the AI creating these successor AIs that is the problem for Bob (not the successor AIs themselves). The act of creating a successor AI that is unable to punish is morally equivalent to not punishing. It does not change anything. Similarly: the act of creating a lot of human level AIs is in itself determining the fate of the world (even if these successor AIs do not have the ability to determine the fate of the world).
Regarding the last paragraph that talks about finding a clever solution:
I’m not sure I understand this paragraph. I agree that if the set is not empty, then a clever AI will presumably find an action that is a Pareto Improvement. I am not saying that there exists an action that is a Pareto Improvement, but that this action is difficult to find. I am saying that at least one person will demand X and that at least one person will refuse X. Which means that a clever AI will just use its cleverness to confirm that the set is indeed empty.
I’m not sure that the following is actually responding to something that you are saying (since I don’t know if I understand what you mean). But it seems relevant to point out that the Pareto constraint is part of the AIs goal definition. Which in turn means that before determining the members of the set of Pareto Improvements, there is no sense in which there exists a clever AI that is trying to make things work out well. In other words: there does not exist any clever AI, that has the goal of making the set non-empty. No one has, for example, an incentive to tweak the extrapolation definitions to make the set non-empty.
Also: in the proposal in question, extrapolated delegates are presented with a set. Their role is then supposed to be to negotiate about actions in this set. I am saying that they will be presented with an empty set (produced by an AI that has no motivation to bend rules to make this set non-empty). If various coalitions of delegates are able to expand this set with clever tricks, then this would be a very different proposal (or a failure to implement the proposal in question). This alternative proposal would for example lack the protections for individuals, that the Pareto constraint is supposed to provide. Because the delegates of various types of fanatics could then also use clever tricks to expand the set of actions under consideration. The delegates of various factions of fanatics could then find clever ways of adding various ways of punishing heretics into the set of actions that are on the table during negotiations (which brings us back to the horrors implied by PCEV). Successful implementation of Pareto PCEV implies that the delegates are forced to abide by the various rules governing their negotiations (similar to how successful implementation of classical PCEV implies that the delegates have successfully been kept in the dark regarding how votes are actually settled).
A few tangents:
This last section is not a direct response to anything that you wrote. In particular, the points below are not meant as arguments against things that you have been advocating for. I just thought that this would be a good place to make a few points, that are related to the general topics that we are discussing in this thread (there is no post dedicated to Pareto PCEV, so this is a reasonable place to elaborate on some points related specifically to PPCEV).
I think that if one only takes into account the opinions of a group that is small enough for a Pareto Improvement to exist, then the outcome would be completely dominated by people that are sort of like Bob, but that are just barely possible to bribe (for the same reason that PCEV is dominated by such people). The bribe would not primarily be about resources, but about what conditions various people should live under. I think that such an outcome would be worse than extinction from the perspective of many people that are not part of the group being taken into consideration (including from the perspective of people like Bob. But also from the perspective of people like Dave). And it would just barely be better than extinction for many in that group.
I similarly think that if one takes the full population, but bend the rules until one gets a non-empty set of things that sort of looks close to Pareto Improvements, then the outcome will also be dominated by people like Bob (for the same reason that PCEV is dominated by people like Bob). Which in turn implies a worse-than-extinction outcome (in expectation, from the perspective of most individuals).
In other words: I think that if one goes looking for coherent proposals that are sort of adjacent to this idea, then one would tend to find proposals that implies very bad outcomes. For the same reasons that proposals along the lines of PCEV implies very bad outcomes. A brief explanation of why I think this: if one tweaks this proposal until it refers to something coherent, then Steve has no meaningful influence regarding the adoption of those preferences that refer to Steve. Because when one is transforming this into something coherent, then Steve cannot retain influence over everything that he cares about strongly enough (as this would result in overlap). And there is nothing in this proposal that gives Steve any special influence regarding the adoption of those preferences that refer to Steve. Thus, in adjacent-but-coherent proposals, Steve will have no reason to expect that the resulting AI will want to help Steve, as opposed to want to hurt Steve.
It might also be useful to zoom out a bit from the specific conflict between what Bob wants and what Dave wants. I think that it would be useful to view the Pareto constraint as many individual constraints. This set of constraints would include many hard constraints. In particular, it would include many trillions of hard individual-to-individual constraints (including constraints coming from a significant percentage of the global population, that have non-negotiable opinions regarding the fates of billions of other individuals). It is an equivalent but more useful way of representing the same thing. (In addition to being quite large, this set would also be very diverse. It would include hard constraints from many different kinds of non-standard minds. With many different kinds of non-standard ways of looking at things. And many different kinds of non-standard ontologies. Including many types of non-standard ontologies that the designers never considered). We can now describe alternative proposals where Steve gets a say regarding those constraints that only refer to Steve. If one is determined to start from Pareto PCEV, then I think that this is a much more promising path to explore (as opposed to exploring different ways of bending the rules until every single hard constraint can be simultaneously satisfied).
I also think that it would be a very bad idea to go looking for an extrapolation dynamic that re-writes Bob’s values in a way that makes Bob stop wanting Dave to be punished (or that makes Bob bribable). I think that extrapolating Bob in an honest way, followed by giving Dave a say regarding those constraints that refer to Dave, is a more promising place to start looking for ways of keeping Dave safe from people like Bob. I for example think that this is less likely to result in unforeseen side effects (extrapolation is problematic enough without this type of added complexity. The option of designing different extrapolation dynamics for different groups of people is a bad option. The option of tweaking an extrapolation dynamic that will be used on everyone, with the intent of finding some mapping that will turn Bob into a safe person, is also a bad option).

ThomasCederborg 20 Jan 2025 3:12 UTC
3 points
0
in reply to: Martin Randall’s comment on: A problem shared by many different alignment targets
Bob really does not want the fate of the world to be determined by an unethical AI. There is no reason for such a position to be instrumental. For Bob, this would be worse than the scenario with no AI (in the Davidad proposal, this is the baseline that is used to determine whether or not something is a Pareto-improvement). Both scenarios contain non-punished heretics. But only one scenario contains an unethical AI. Bob prefers the scenario without an unethical AI (for non-instrumental reasons).
Regarding extrapolation:
The question is whether or not at least one person will continue to view a non-punishing AI as unethical after extrapolation. (When determining whether or not something is a Pareto-improvement, the average fanatic is not necessarily relevant).
Many people would indeed presumably change their minds regarding the morality of at least some things (for example when learning new facts). For the set of Pareto-improvements to be empty however, you only need two people: a single fanatic and a single heretic.
In other words: for the set to be empty it is enough that a single person continues to view a single other person (that we can call Dave), as being deserving of punishment (in the sense that an AI has a moral obligation to punish Dave). The only missing component is then that Dave has to object strongly to being punished for being a heretic (this objection can actually also be entirely based on moral principles). Just two people out of billions need to take these moral positions for the set to be empty. And the building blocks that make up Bob’s morality are not actually particularly rare.
The first building block of Bob’s morality is that of a moral imperative (the AI is seen as unethical for failing to fulfill its moral obligation to punish heretics). In other words: if someone finds themselves in a particular situation, then they are viewed as having a moral obligation to act in a certain way. Moral instincts along the lines of moral imperatives are fairly common. A trained firefighter might be seen as having important moral obligations if encountering a burning building with people inside. An armed police officer might be seen as having important moral obligations if encountering an active shooter. Similarly for soldiers, doctors, etc. Failing to fulfill an important moral obligation is fairly commonly seen as very bad.
Let’s take Allan, who witnesses a crime being committed by Gregg. If the crime is very serious, and calling the police is risk free for Allan, then failing to call the police can be seen as a very serious moral outrage. If Allan does not fulfill this moral obligation, it would not be particularly unusual for someone to view Allan as deeply unethical. This general form of moral outrage is not rare. Not every form of morality includes contingent moral imperatives. But moralities that do include such imperatives are fairly common. There is obviously a lot of disagreements regarding who has what moral obligations. Just as there are disagreements regarding what should count as a crime. But the general moral instinct (that someone like Allan can be deeply unethical) is not exotic or strange.
The obligation to punish bad people is also not particularly rare. Considering someone to be unethical because they get along with a bad person is not an exotic or rare type of moral instinct. It is not universal. But it is very common.
And the specific moral position that heretics deserve to burn in hell is actually quite commonly expressed. We can argue about what percentage of people saying this actually means it. But surely we can agree that there exist at least some people that actually mean what they say.
The final building block in Bob’s morality is objecting to having the fate of the world be determined by someone unethical. I don’t think that this is a particularly unusual thing to object to (on entirely non-instrumental grounds). Many people care deeply about how a given outcome is achieved.
Some people that express positions along the lines of Bob might indeed back down if things get real. I think that for some people, survival instinct would in fact override any moral outrage. Especially if the non-AI scenario is really bad. Some fanatics would surely blink when coming face to face with any real danger. (And some people will probably abandon their entire moral framework in a heartbeat, the second someone offers them a really nice cake). But for at least some people, morality is genuinely important. And you only need one person like Bob, out of billions, for the set to be empty.
So, if Bob is deeply attached to his moral framework. And the moral obligation to punish heretics is a core aspect of his morality. And this aspect of his morality is entirely built from ordinary and common types of moral instincts. Then an extrapolated version of Bob would only accept a non-punishing AI, if this extrapolation method has completely rewritten Bob’s entire moral framework (in ways that Bob would find horrific).

ThomasCederborg 19 Jan 2025 4:41 UTC
3 points
0
in reply to: Martin Randall’s comment on: A problem shared by many different alignment targets
Consider Bob, who takes morality very seriously. Bob thinks that any scenario where the fate of the world is determined by an unethical AI, is worse than the scenario with no AI. Bob sticks with this moral position, regardless of how much stuff Bob would get in a scenario with an unethical AI. For a mind as powerful as an AI, Bob considers it to be a moral imperative to ensure that heretics do not escape punishment. If a group contains at least one person like Bob (and at least one person that would strongly object to being punished), then the set of Pareto-improvements is empty. In a population of billions, there will always exist at least some people with Bob’s type of morality (and plenty of people that would strongly object to being punished). Which in turn means that for humanity, there exist no powerful AI, such that creating this AI would be a Pareto-improvement.

ThomasCederborg 16 Jan 2025 22:28 UTC
1 point
0
in reply to: Seth Herd’s comment on: A problem shared by many different alignment targets
I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the path to a Group AI goes through an initial limited AI (referred to as a Task AI). In other words: the classical proposed path to an AI that implements the CEV of Humanity actually starts with an initial AI that is not an AI Sovereign (and such an AI could for example be the type of instruction following AI that you mention). In yet other words: your proposed AI is not an alternative to a Group AI. Its successful implementation does not prevent the later implementation of a Group AI. Your proposed AI is in fact one step in the classical (and still fairly popular) proposed path to a Group AI.
I actually have two previous posts that were devoted to making the case for analysing the types of alignment targets that the present post is focusing on. The present post is instead focusing on doing such analysis. This previous post outlined a comprehensive argument in favour of analysing these types of alignment targets. Another previous post specifically focused on illustrating that Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure. See also this comment where I discuss the difference between proposing solutions on the one hand, and pointing out problems on the other hand.
Charbel-Raphaël responded to my post by arguing that no Sovereign AI should ever be created. My reply pointed out that this is mostly irrelevant to the question at hand. The only relevant question is whether or not a Sovereign AI might be successfully implemented eventually. If that is the case, then one can reduce the probability of some very bad outcomes by doing the type of Alignment Target Analysis that my previous two posts were arguing for (and that the present post is an example of). The second half of this reply (later in the same thread) includes a description of an additional scenario where an initial limited AI is followed by a Sovereign AI (and this Sovereign AI is implemented without significant time spent on analysing the specific proposal, due to Internal Time Pressure).
Regarding Corrigibility as a Singular Target:
I don’t think that one can rely on this idea to prevent the outcome where a dangerous Sovereign AI proposal is successfully implemented at some later time (for example after an initial AI has been used to buy time). One issue is the difficulty of defining critical concepts such as Explanation and Understanding. I previously discussed this with Max Harms here, and with Nathan Helm-Burger here. Both of those comments are discussing attempts to make an AI pursue Corrigibility as a Singular Target (which should not be confused with my post on Corrigibility, which discussed a different type of Corrigibility).
Regarding what the designers might want:
The people actually building the stuff might not be the ones deciding what should be built. For example: if a messy coalition of governments enforces a global AI pause, then this coalition might be able to decide what will eventually be built. If a coalition is capable of successfully enforcing a global AI pause, then I don’t think that we can rule out the possibility that they will be able to enforce a decision to build a specific type of AI Sovereign (they could for example do this as a second step, after first managing to gain effective control over an initial instruction following AI). If that is the case, then the proposal to build something along the lines of a Group AI might very well be one of the politically feasible options (this was previously discussed in this post and in this comment).

A problem shared by many different alignment targets

ThomasCederborg15 Jan 2025 14:22 UTC

13 points

18 comments36 min readLW link

ThomasCederborg 23 Oct 2024 5:28 UTC
1 point
0
in reply to: Roko’s comment on: The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind
I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comment (details below).
The first section argues that (unless Bob’s basic moral framework has been dramatically changed by the memory deletion) no level of memory deletion will prevent BPA from wanting to find and hurt Steve. In brief: BPA will still be subject to the same moral imperative to find and hurt any existing heretics (including Steve).
The second section argues that BPA is likely to find Steve. In brief: BPA is a clever AI and the memory deletion is a human constructed barrier (the Advocates are extrapolations of people that has already been subjected to these memory wipes. So Advocates cannot be involved when negotiating the rules governing these memory wipes). BPA would still have access to a lot of different information sources that it can use to find Steve.
The third section argues that if BPA finds Steve, then BPA would be able to hurt Steve. In brief: creating OldSteve is still not prevented by any rule or constraint that you have mentioned so far.
The fourth section argues that the side effects of memory deletion would be severe. In brief: memories of every conversation about any deleted person would also be deleted. Besides all direct memories involving any deleted person, many indirect memories would also be deleted. This would seriously complicate extrapolation. (Extrapolation is already a very tricky definitional problem. And this definitional problem cannot be delegated to extrapolated Advocates, since they are the result of the extrapolation dynamic being defined).
The last section deals with your proposed resource destruction mechanism. In brief: in one thought experiment (that did not involve threats) almost half the population would very strongly prefer to destroy ELYSIUM. This disproved your claim that such a situation would not arise. Also: no thought experiment involved anyone trying to gain resources. And no thought experiment involved anyone issuing any form of threat that would not be followed up on. People burned resources that they did not need. They did this to either hurt people directly. Or to issue genuinely credible threats.
BPA will still want to find and hurt heretics
It would not be surprising if a comprehensive, human defined, memory deletion operation would completely re write someone’s basic moral framework. But I’m guessing that your proposed memory deletion is not supposed to be done in a way that changes a persons basic moral framework. So let’s reason from the assumption that it does not.
This means that Bob still considers BPA to be bound by a non negotiable moral imperative. So BPA still wants to find and punish any heretic that might exist.
In other words: the Cosmic Block is not the type of mechanism that might prevent BPA from wanting to find and hurt Steve. It is instead a practical obstacle that BPA needs to overcome (which is something very different). It is a human constructed practical barrier, that is supposed to protect Steve from a clever AI that wants to find and hurt Steve.
BPA will likely be able to find Steve
Unless the details of your proposed Cosmic Block are constructed by an AI that prevents All Bad Things, these rules must come from somewhere else. AI assisted negotiations cannot be done by the Advocates. Advocates are the result of extrapolating memory wiped people (otherwise the whole point of the Cosmic Bloc is lost). So the Advocates cannot be involved in defining the memory wipe rules.
In other words: unless the memory wipe rules are negotiated by a completely separate set of (previously unmentioned) AIs, the memory wipe rules will be human defined.
This means that a human constructed barrier must hold against a clever AI trying to get around it. Even if we were to know that a human defined barrier has no humanly-findable security holes, this does not mean that it will actually hold against a clever AI. A clever AI can find security holes that are not humanly-findable.
The specific situation that BPA will find itself in does not seem to be described in sufficient detail for it to be possible to outline a specific path along which BPA finds Steve. But from the currently specified rules, we do know that BPA has access to several ways of gathering information about Steve.
People can pool resources (as described in your original proposal). So Advocates can presumably ask other Advocates about potential partners for cohabitation. Consider the case where BPA is negotiating with other Advocates regarding who will be included in a potential shared environment. This decision will presumably involve information about potential candidates. Whether or not a given person is accepted, would presumably depend on detailed personal information.
Advocates can also engage in mutual resource destruction to prevent computations happening within other Utopias. You describe this mechanism as involving negotiations between Advocates, regarding computations happening within other people’s Utopias. Such negotiations would primarily be between the Advocates of people that have very different values. This is another potential information source about Steve.
Steve would also have left a lot of effects on the world, besides effects on peoples memories. Steve might for example have had a direct impact on what type of person someone else has turned into. Deleting this impact would be even more dramatic than deleting memories.
Steve might have also have had a significant impact on various group dynamics (for example: his family, the friend groups that he has been a part of, different sets of coworkers and classmates, online communities, etc). Unless all memories regarding the general group dynamics of every group that Steve has been a part of is deleted, Steve’s life would have left behind many visible effects.
The situation is thus that a clever AI is trying to find and hurt Steve. There are many different types of information sources that can be combined in clever ways to find Steve. The rules of all barriers between this AI and Steve are human constructed. Even with perfect enforcement of all barriers, this still sounds like a scenario where BPA will find Steve (for the same reason that a clever AI is likely to find its way out of a human constructed box, or around a human constructed Membrane).
There is still nothing protecting Steve from BPA
If BPA locates Steve, then there is nothing preventing BPA from using OldSteve to hurt Steve. What is happening to OldSteve is still not prevented by any currently specified rule. The suffering of OldSteve is entirely caused by internal dynamics. OldSteve never lacks any form of information. And the harm inflicted on OldSteve is not in any sense marginal.
I do not see any strong connections between the OldSteve thought experiment and your Scott Alexander quote (which is concerned with the question of what options and information should be provided by a government run by humans. To children raised by other humans). More generally: scenarios that include a clever AI that is specifically trying to hurt someone, has a lot of unique properties (important properties that are not present in scenarios that lack such an AI). I think that these scenarios are dangerous. And I think that they should be avoided (as opposed to first created and then mitigated). (Avoiding such scenarios is a necessary, but definitely not sufficient, feature of an alignment target).
Memory wipes would complicate extrapolation
All deleted memories must be so thoroughly wiped that a clever AI will be unable to reconstruct them (otherwise the whole point of the Cosmic Block is negated). Deleting all memories of a single important negative interpersonal relationship would be a huge modification. Even just deleting all memories of one famous person that served as a role model would be significant.
Thoroughly deleting your memory of a person, would also impact your memory of every conversation that you have ever had about this person. Including conversations with people that are not deleted. Most long term social relationships involves a lot of discussions of other people (one person describing past experiences to the other, discussions of people that both know personally, arguments over politicians or celebrities, etc, etc). Thus, the memory deletion would significantly alter the memories of essentially all significant social relationships. This is not a minor thing to do to a person. (That every person would be subjected to this is not obviously implied by the text in The ELYSIUM Proposal.)
In other words: even memories of non deleted people would be severely modified. For example: every discussion or argument about a deleted person would be deleted. Two people (that do not delete each other) might suddenly have no idea why they almost cut all contact a few years ago, and why their interactions has been so different for the last few years. Either their Advocates can reconstruct the relevant information (in which case the deletion does not serve its purpose). Or their Advocates must try to extrapolate them while lacking a lot of information.
Getting the definitions involved in extrapolation right, seems like it will be very difficult even under ordinary circumstances. Wide ranging and very thorough memory deletion would presumably make extrapolation even more tricky. This is a major issue.
Your proposed resource destruction mechanism
No one in any of my thought experiments was trying to get more resources. The 55 percent majority (and the group of 10 people) have a lot of resources that they do not care much about. They want to create some form of existence for themselves. This only takes a fraction of available resources to set up. They can then burn the rest of their resources on actions within the resource destruction mechanism. They either burn these resources to directly hurt people. Or they risk these resources by making threats that are completely credible. In the thought experiments where someone does issue a threat, the threat is issued because: a person giving in > burning resources to hurt someone who refuses > leaving someone that refuses alone. They are perfectly ok with an outcome where resources are spent on hurting someone that refuses to comply (they are not self modifying as a negotiation strategy. They just think that this is a perfectly ok outcome).
Preventing this type of threats would be difficult because (i): negotiations are allowed, and (ii): in any scenario where threats are prevented, the threatened action would simply be taken (for non strategic reasons). There is no difference in behaviour between scenarios where threats are prevented, and scenarios where threats are ignored.
The thought experiment where a majority burns resources to hurt a minority was a simple example scenario where almost half of the population would very strongly prefer to destroy ELYSIUM (or strongly prefer that ELYSIUM was never created). It was a response to your claim that your resource destruction mechanisms would prevent such a scenario. This thought experiment did not involve any form of threat or negotiation.
Let’s call a rule that prevents the majority from hurting the minority a Minority Protection Rule (MPR). There are at least two problems with your claim that a pre-AI majority would prevent the creation of a version of ELYSIUM that has an MPR.
First: without an added MPR, the post-AI majority is able to hurt the minority without giving up anything that they care about (they burn resources they don’t need). So there is no reason to think that an extrapolated post-AI majority would want to try to prevent the creation of a version of ELYSIUM with an MPR. They would prefer the case without an MPR. This does not imply that they care enough to try to prevent the creation of a version of ELYSIUM with an MPR. Doing so would presumably be very risky, and they don’t gain anything that they care much about. When hurting the minority does not cost them anything that they care about, they do it. That does not imply that this is an important issue for the majority.
More importantly however: you are conflating, (i): a set of un-extrapolated and un-coordinated people living in a pre-AI world, with (ii): a set of clever AI Advocates representing these same people, operating in a post-AI world. There is nothing unexpected about humans opposing / supporting an AI that would be good / bad for them (from the perspective of their extrapolated Advocates). That is the whole point of having extrapolated Advocates.

ThomasCederborg 19 Oct 2024 15:23 UTC
9 points
0
in reply to: Roko’s comment on: The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind
Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create such a set of AIs in the first place (details below).
Regarding AI assisted negotiations
I don’t think that it is easy to find a negotiation baseline for AI-assisted negotiations that results in a negotiated settlement that actually deals with such a set of AIs. Negotiation baselines are non trivial. Reasonable sounding negotiation baselines can have counterintuitive implications. They can imply power imbalance issues that are not immediately obvious. For example: the random dictator negotiation baseline in PCEV gives a strong negotiation advantage to people that intrinsically values hurting other humans. This went unnoticed for a long time. (It has been suggested that it might be possible to find a negotiation baseline (a BATNA) that can be viewed as having been acausally agreed upon by everyone. However, it turns out that this is not actually possible for a group of billions of humans).
The proposal to have a simulated war that destroys resources
10 people without any large resource needs could use this mechanism to kill 9 people they don’t like at basically no cost (defining C as any computation done within the Utopia of the person they want to kill). Consider 10 people that just want to live a long life, and that do not have any particular use for most of the resources they have available. They can destroy all computational resources of 9 people without giving up anything that they care about. This also means that they can make credible threats. Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.
This mechanism does not rule out scenarios where a lot of people would strongly prefer to destroy ELYSIUM. A trivial example would be a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority. In this scenario almost half of all people would very strongly prefer to destroy ELYSIUM. Such a majority could alternatively credibly threaten the minority and force them to modify the way they live their lives. The threat would be especially credible if the majority likes the scenario where a minority is punished for refusing to conform.
In other words: this mechanism seems to be incompatible with your description of personalised Utopias as the best possible place to be (subject only to a few non intrusive ground rules).
The Cosmic Block and a specific set of tests
This relies on a set of definitions. And these definitions would have to hold up against a set of clever AIs trying to break them. None of the rules that you have proposed so far would prevent the strategy used by BPA to punish Steve, outlined in my initial comment. OldSteve is hurt in a way that is not actually prevented by any rule that you have described so far. For example: the ``is torture happening here″ test would not trigger for what is happening to OldSteve. So even if Steve does in principle have the ability to stop this by using some resources destroying mechanism, Steve will not be able to do so. Because Steve will never become aware of what Bob is doing to OldSteve. Steve considers OldSteve to be himself in a relevant sense. So, according to Steve’s worldview, Steve will experience a lot of very unpleasant things. But the only version of Steve that would be able to pay resources to stop this, would not be able to do so.
So the security hole pointed out by me in my original thought experiment is still not patched. And patching this security hole would not be enough. To protect Steve, one would need to find a set of rules that preemptively patches every single security hole that one of these clever AIs could ever find.
I think that it would be better to just not create such a set of AIs
Let’s reason from the assumption that Bob’s Personal Advocate (BPA) is a clever AI that will be creating Bob’s Personalised Utopia. Let’s now again take the perspective of ordinary human individual Steve, that gets no special treatment. I think the main question that determines Steve’s safety in this scenario, is how BPA is adopting Steve-referring-preferences. I think this is far more important for Steve’s safety, than the question of what set of rules will govern Bob’s Personalised Utopia. The question of what BPA wants to do to Steve, seems to me to be far more important for Steve’s safety, than the question of what set of rules will constrain the actions of BPA.
Another way to look at this is to think in terms of avoiding contradictions. And in terms of making coherent proposals. A proposal that effectively says that everyone should be given everything that they want (or effectively says that everyone’s values should be respected) is not a coherent proposal. These things are necessarily defined in some form of outcome or action space. Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.
This can be contrasted with giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her. Since this is defined in preference adoption space, it cannot guarantee that everyone will get everything that they want. But it also means that it does not imply contradictions (see this post for a discussion of these issues in the context of Membrane formalisms). Giving everyone such influence is a coherent proposal.
It also happens to be the case that if one wants to protect Steve from a far superior intellect, then preference adoption space seems to be a lot more relevant than any form of outcome or action space. Because if a superior intellect wants to hurt Steve, then one has to defeat a superior opponent in every single round of a near infinite definitional game (even under the assumption of perfect enforcement, winning every round in such a definitional game against a superior opponent seems hopeless). In other words: I don’t think that the best way to approach this is to ask how one might protect Steve from a large set of clever AIs that wants to hurt Steve for a wide variety of reasons. I think a better question is to ask how one might prevent the situation where such a set of AIs wants to hurt Steve.

ThomasCederborg 18 Oct 2024 5:29 UTC
11 points
0
in reply to: Roko’s comment on: The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind
My thought experiment assumed that all rules and constraints described in the text that you linked to had been successfully implemented. Perfect enforcement was assumed. This means that there is no need to get into issues such as relative optimization power (or any other enforcement related issue). The thought experiment showed that the rules described in the linked text does not actually protect Steve from a clever AI that is trying to hurt Steve (even if these rules are successfully implemented / perfectly enforced).
If we were reasoning from the assumption that some AI will try to prevent All Bad Things, then relative power issues might have been relevant. But there is nothing in the linked text that suggests that such an AI would be present (and it contains no proposal for how one might arrive at some set of definitions that would imply such an AI).
In other words: there would be many clever AIs trying to hurt people (the Advocates of various individual humans). But the text that you link to does not suggest any mechanism, that would actually protect Steve from a clever AI trying to hurt Steve.
There is a ``Misunderstands position?″ react to the following text:
The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules …
In The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt individual humans (the Advocates of various individual humans). So I assume that the issue is with the protection part of this sentence. The thought experiment outlined in my comment assumes perfect enforcement (and my post that this sentence is referring to also assumes perfect enforcement). It would have been redundant, but I could have instead written:
The scenario where a clever AI wants to hurt a human that is only protected by a set of perfectly enforced human constructed rules …
I hope that this clarifies things.
The specific security hole illustrated by the thought experiment can of course be patched. But this would not help. Patching all humanly findable security holes would also not help (it would prevent the publication of further thought experiments. But it would not protect anyone from a clever AI trying to hurt her. And in The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt people). The analogy with an AI in a box is apt here. If it is important that an AI does not leave a human constructed box (analogous to: an AI hurting Steve). Then one should avoid creating a clever AI that wants to leave the box (analogous to: avoid creating a clever AI that wants to hurt Steve). In other words: Steve’s real problem is that a clever AI is adopting preferences that refer to Steve, using a process that Steve has no influence over.
(Giving each individual influence over the adoption of those preferences that refer to her would not introduce contradictions. Because such influence would be defined in preference adoption space. Not in any form of action or outcome space. In The ELYSIUM Proposal however, no individual would have any influence whatsoever, over the process by which billions of clever AIs, would adopt preferences, that refer to her)

ThomasCederborg 17 Oct 2024 4:56 UTC
11 points
2
in reply to: Nathan Helm-Burger’s comment on: The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind
Let’s optimistically assume that all rules and constraints described in The ELYSIUM Proposal are successfully implemented. Let’s also optimistically assume that every human will be represented by an Advocate that perfectly represents her interests. This will allow us to focus on a problem that remains despite these assumptions.
Let’s take the perspective of ordinary human individual Steve. Many clever and powerful AIs would now adopt preferences that refer to Steve (the Advocates of humans that have preferences that refer to Steve). Steve has no influence regarding the adoption of these Steve-Preferences. If one of these clever and powerful AIs wants to hurt Steve, then Steve is only protected by a set of human constructed rules.
The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules was previously discussed here. That post is about Membrane formalisms. But the argument is similar to the argument presented below. Both arguments are also similar to arguments about the difficulty of keeping a clever AI contained in a human constructed box (if it is important that an AI stays in a human constructed box. Then one should not build a clever AI that wants to leave the box. If a clever AI wants to leave the box, then plugging all human-findable security holes is not enough. Because the clever AI can find a security hole that is not humanly-findable). Very briefly: this general situation is dangerous for Steve, because the AI that wants to hurt Steve is more clever than the humans that constructed the rules that are supposed to protect Steve.
Let’s explore one specific example scenario where a clever AI finds a way around the specific rules outlined in the text of The ELYSIUM Proposal. Bob does not want to hurt anyone. Bob certainly does not want to use his Utopia as a weapon. However, it is important for Bob that Bob’s Utopia was constructed by an ethical AI. A moral imperative for such an AI is that it must punish heretics (if such heretics exist). Bob would prefer a world where no one is a heretic and no one suffers. But unfortunately Steve is a heretic. And the moral imperative to punish Steve is more important than the avoidance of suffering. So Bob’s Personal Advocate (BPA) will try to punish Steve.
Steve now faces a clever AI trying to hurt him, and his only protection against this AI is a set of human constructed rules. Even if no human is able to find a way around some specific set of human constructed rules, BPA will be able to think up strategies that no human is able to comprehend (this more serious problem would remain, even if the security hole described below is fully patched). The real problem faced by Steve is that a clever AI has adopted Steve-referring-preferences. And Steve had no influence regarding the decision of which Steve-preferences would be adopted by this clever AI. But let’s now return to discussing one specific strategy that BPA can use to hurt Steve without breaking any of the rules described in this specific text.
BPA is constrained by the requirement that all created minds must enthusiastically consent to being created. The other constraint is that BPA is not allowed to torture any created mind. The task of BPA is thus to construct a mind that (i): would enthusiastically consent to being created, and (ii): would suffer in ways that Steve would find horrific, even though no one is torturing this mind.
The details will depend on Steve’s worldview. The mind in question will be designed specifically to hurt Steve. One example mind that could be created is OldSteve. OldSteve is what Steve would turn into, if Steve were to encounter some specific set of circumstances. Steve considers OldSteve to be a version of himself in a relevant sense (if Steve did not see things in this way, then BPA would have designed some other mind). OldSteve has adopted a worldview that makes it a moral obligation to be created. So OldSteve would enthusiastically consent to being created by BPA. Another thing that is true of OldSteve, is that he would suffer horribly due to entirely internal dynamics (OldSteve was designed by a clever AI, that was specifically looking for a type of mind that would suffer due to internal dynamics).
So OldSteve is created by BPA. And OldSteve suffers in a way that Steve finds horrific. Steve does not share the moral framework of OldSteve. In particular: Steve does not think that OldSteve had any obligation to be created. In general, Steve does not see the act of creating OldSteve as a positive act in any way. So Steve is just horrified by the suffering. BPA can crate a lot of copies of OldSteve with slight variations, and keep them alive for a long time.
(This comment is an example of Alignment Target Analysis (ATA). This post argued that doing ATA now is important, because there might not be a lot of time to do ATA later (for example because Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure). There are many serious AI risks that cannot be reduced by any level of ATA progress. But ATA progress can reduce the probability of a bad alignment target getting successfully implemented. A risk reduction focused ATA project would be tractable, because risks can be reduced even if one is not able to find any good alignment target. This comment discuss which subset of AI risks can (and cannot) be reduced by ATA. This comment is focused on a different topic but it contains a discussion of a related concept (towards the end it discusses the importance of having influence over the adoption of self-referring-preferences by clever AIs).)

ThomasCederborg 11 Oct 2024 1:14 UTC
3 points
0
in reply to: johnswentworth’s comment on: A Pivotal Act AI might not buy a lot of time
I changed the title from: ``A Pivotal Act AI might not buy a lot of time″ to: ``Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure″.
As explained by Martin Randall, the statement: ``something which does not buy ample time is not a pivotal act″ is false (based on the Arbital Guarded Definition of Pivotal Act). Given your ``Agreed react″ to that comment, this issue seems to be settled. In the first section of the present comment, I explain why I still think that the old title was a mistake. The second section outlines a scenario that better illustrates that a Pivotal Act AI might not buy a lot of time.
Why the old title was a mistake
The old title implied that launching the LAI was a very positive event. With the new title, launching the LAI may or may not have been a positive event. This was the meaning that I intended.
Launching the LAI drastically increased the probability of a win by shutting down all competing AI projects. It however also increased risks from scenarios where someone successfully hits a bad alignment target. This can lead to a massively worse than extinction outcome (for example along the lines of the outcome implied by PCEV). In other words: launching LAI may or may not have been a positive event. Thus, launching the LAI may or may not have been a Pivotal Act according to the Arbital Guarded Definition (which requires the event to be very positive).
The old title does not seem to be incompatible with the actual text of the post. But it is incompatible with my intended meaning. I didn’t intend to specify whether or not LAI was a positive event. Because the argument about the need for Alignment Target Analysis (ATA) goes through regardless of whether or not launching LAI was a good idea. Regardless of whether or not launching LAI was a positive event, ATA work needs to start now to reduce risks. Because in both cases, ATA progress is needed to reduce risks. And in both cases, there is not a lot of time to do ATA later. (ATA is in fact more important in scenarios where launching the LAI was in fact a terrible mistake)
As I show in my other reply: there is a well established convention of using the term Pivotal Act as a shorthand for shutting down all competing AI projects. As can be seen by looking at the scenario in the post: this might not buy a lot of time. That is how I was using the term when I picked the old title.
A scenario that better illustrates why a Pivotal Act AI might not buy a lot of time
This section outlines a scenario where an unambiguous Pivotal Act is instantly followed by a very severe time crunch. It is possible to see that a Pivotal Act AI might not buy a lot of time by looking at the scenario in the post. But the present section will outline a scenario that better illustrates this fact. (In other words: this section outlines a scenario for which the old title would actually be a good title.) In this new scenario, a Pivotal Act dramatically reduces the probability of extinction by shutting down all unauthorised AI projects. It also completely removes the possibility of anything worse than extinction. Right after the Pivotal Act, there is a frenzied race against the clock to make enough progress on ATA before time runs out. Failure results in a significant risk of extinction.
Consider the case where Dave launches Dave’s AI (DAI). If DAI had not been launched, everyone would have almost certainly been killed by some other AI. DAI completely and permanently shuts down all competing AI projects. DAI also reliably prevents all scenarios where designers fail to hit the alignment target that they are aiming at. Due to Internal Time Pressure, a Sovereign AI must then be launched very quickly (discussions of Internal Time Pressure can be found here, and here, and here). There is very little time to decide what alignment target to aim at. (The point made in this section is not sensitive to who gave Dave permission to launch DAI. Or sensitive to who DAI will defer to for the choice of alignment target. But for the sake of concreteness, let’s say that the UN security council authorised DAI. And that DAI defers to a global electorate regarding the choice of alignment target).
By the time Dave launches DAI, work on ATA has already progressed very far. There already exist many alignment targets that would in fact lead to an unambiguous win (somehow, describing these outcomes as a win is objectively correct). Only one of the many proposed alignment targets still has an unnoticed problem. And this problem is not nearly as severe as the problem with PCEV. People take the risks of unnoticed problems very seriously. But due to severe Internal Time Pressure, there is not much they can do with this knowledge. The only option is to use their limited time to analyse all alignment targets that are being considered. (many very optimistic assumptions are made regarding both DAI and the level of ATA progress. This is partly to make sure that readers will agree that the act of launching DAI should count as a Pivotal Act. And partly to show that ATA might still be needed, despite these very optimistic assumptions).
The only alignment target that is not a clear win, is based on maximising the sum of re-normalised utility functions. The proposed AI includes a proposed way of mapping a human to a utility function. This always results in a perfect representation of what the human wants. (And there are no definitional issues with this mapping). These functions are then renormalised to have the same variance (as discussed here). Let’s write VarAI for this AI. VarAI maximises the sum of the renormalised functions. The aggregation method described above has a problem that is obvious in retrospect. If that problem is explained, then it is clear that VarAI is an unacceptable alignment target. However, in this scenario, no one has noticed this problem. The question is now whether or not anyone will notice the problem (before an alignment target needs to be settled on).
Due to the problem with this aggregation method, VarAI would create a Soft Maximisation version of a Suffering Reducing AI (SMSRAI) as a successor AI (if VarAI is successfully implemented and pointed at the population in this thought experiment). So, if the proponents of VarAI happens to win the political fight, then the result would be SMSRAI. SMSRAI would work to reduce both average suffering and population size (without doing anything drastic). The quality of most peoples lives would increase very dramatically. Many people would choose to spend centuries living a very nice life. But after some conversations with SMSRAI, few would choose to live for millennia. No one would choose to live for many millennia. Humanity would very gradually go extinct in a few dozen generations. By the time it happens, no one would find this problematic in any way. Changes in attitudes are very gradual, and no one ever feels pressured or manipulated by SMSRAI (for example: if you ask SMSRAI to leave you alone, it backs off for a couple of centuries before politely asking again if you want to talk).
If anyone finds the problem with the aggregation method in time, then VarAI will be dropped as a potential alignment target. (Once the problem is explained, it is retrospectively obvious why VarAI is completely unacceptable as an alignment target). If no one finds this problem in time however, it is possible that VarAI will win the political fight.
In order to avoid accidental interference, Dave designed DAI to not discuss ATA. (As a general strategy for reducing risks from definitional issues, Dave actually made sure that DAI only does things that are absolutely necessary for the completion of a carefully chosen set of well defined tasks). So the only way to remove the possibility of extinction, is to reach the needed level of ATA understanding before it’s too late. (If the problem with this aggregation method is not explained to them, their only path to understanding is to try doing ATA).
Despite severe Internal Time Pressure, the possibility of an unnoticed problem in an alignment target is taken very seriously. So a last judge safety add on is implemented. The last judge is extrapolated using a different extrapolation method than the one used in VarAI (in order to give the last judge a chance to veto a bad extrapolation method. The VarAI extrapolation definitions are in fact somehow objectively correct. But there is no way of verifying that fact). The result of the last judge extrapolation method turns out to be a very anxious mind. The result is a mind that is in general not comfortable with objecting to things (such as the extrapolation method that it is the output of, or the outcome implied by an alignment target). This mind is very reluctant to veto a scenario where no one is suffering, and where almost everyone are very happy with all aspects of how things turn out (SMSRAI very gradually, over many generations, ``helps people realise″ that the outcome is actually a good outcome. And people genuinely are having a very nice time, for a lot longer than most people expected). So the off switch is not triggered.
If Dave had not launched DAI, all humans would very likely have been killed very soon by some other AI. So I think a lot of people would consider Launching DAI to be a Pivotal Act. (It completely upset the game board. It drastically increased the probability of a win. It was a very positive event according to a wide range of value systems). But if someone wants humanity to go on existing (or wants to personally live a super long life), then there is not a lot of time to find the problem with VarAI (because without sufficient ATA progress, there still exists a significant probability of extinction). So, launching DAI was a Pivotal Act. And launching DAI did not result in a lot of time to work on ATA. Which demonstrates that a Pivotal Act AI might not buy a lot of time.
One can use this scenario as an argument in favour of starting ATA work now. It is one specific scenario that exemplifies a general class of scenarios: scenarios where starting ATA work now, would further reduce an already small risk of a moderately bad outcome. It is a valid argument. But it is not the argument that I was trying to make in my post. I was thinking of something a lot more dangerous. I was imagining a scenario where a bad alignment target is very likely to get successfully implemented unless ATA progresses to the needed levels of insight before it is too late. And I was imagining an alignment target that implied a massively worse than extinction outcome (for example along the lines of the outcome implied by PCEV). I think this is a stronger argument in favour of starting work on ATA now. And this interpretation was ruled out by the old title (which is why I changed the title).
(a brief tangent: if someone expects everything to turn out well. But would like to work on ATA in order to further reduce a small probability of something going moderately bad. Then I would be very happy to collaborate with such a person in a future ATA project. Having very different perspectives in an ATA project sounds like a great idea. An ATA project is very different from a technical design project where a team is trying to get something implemented that will actually work. There is really no reason for people to have similar worldviews or even compatible ontologies. It is a race against time to find a conceptual breakthrough of an unknown type. It is a search for an unnoticed implicit assumption of an unknown type. So genuinely different perspectives sounds like a great idea)
In summary: ``A Pivotal Act AI might not buy a lot of time″ is in fact a true statement. And it is possible to see this by looking at the scenario outlined in the post. But it was a mistake to use this statement as the title for this post. Because it implies things about the scenario that I did not intend to imply. So I changed the title and outlined a scenario that is better suited for illustrating that a Pivotal Act AI might not buy a lot of time.
PS:
I upvoted johnswentworth’s comment. My original title was a mistake. And the comment helped me realise my mistake. I hope that others will post similar comments on my posts in the future. The comment deserves upvotes. But I feel like I should ask about these agreement votes.
The statement: ``something which does not buy ample time is not a pivotal act″ is clearly false. Martin Randall explained why the statement is false (helpfully pulling out the relevant quotes from the texts that johnswentworth cited). And then johnswentworth did an ``Agreed reaction″ on Martin Randall’s explanation of why the statement is false. After this however, johnswentworth’s comment (with the statement that had already been determined to be false) was agree voted to plus 7. That seemed odd to me. So I wanted to ask about it. (My posts sometimes question deeply entrenched assumptions. And johnswentworth’s comment sort of looks like criticism (at least if one only skims the post and the discussion). So maybe there is no great mystery here. But I still wanted to ask about this. Mostly in case someone has noticed an object level error in my post. But I am also open to terminology feedback)
What links here?
- ThomasCederborg's comment on Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure by ThomasCederborg (EA Forum; 11 Oct 2024 1:19 UTC; 1 point)

ThomasCederborg 4 Oct 2024 15:11 UTC
4 points
0
in reply to: johnswentworth’s comment on: A Pivotal Act AI might not buy a lot of time
I will change the title.
However: you also seem to be using the term Pivotal Act as a synonym for removing all time pressure from competing AI projects (which the AI in my post does). Example 3 of the arbital page that you link to also explicitly refers to an act that removes all time pressure from competing AI projects as a Pivotal Act. This usage is also present in various comments by you, Yudkowsky, and others (see links and quotes below). And there does not seem to exist any other established term for an AI that: (i): completely removes all time pressure from competing AI projects by uploading a design team and giving them infinite time to work, (ii): keeps the designers calm, rational, sane, etc indefinitely (with all definitional issues of those terms fully solved), and (iii): removes all risks from scenarios where someone fails to hit an alignment target. What other established term exists for such an AI? I think people would generally refer to such an AI as a Pivotal Act AI. And as demonstrated in the post: such an AI might not buy a lot of time.
Maybe using the term Pivotal Act as a synonym for an act that removes all time pressure from competing AI projects is a mistake? (Maybe the scenario in my post should be seen as showing that this usage is a mistake?). But it does seem to be a very well established way of using the term. And I would like to have a title that tells readers what the post is about. I think the current title probably did tell you what the post is about, right? (that the type of AI actions that people tend to refer to as Pivotal Acts might not buy a lot of time in reality)
In the post I define new terms. But if I use a novel term in the title before defining the this term, the title will not tell you what the post is about. So I would prefer to avoid doing that.
But I can see why you might want to have Pivotal Act be a protected term for something that is actually guaranteed to buy a lot of time (which I think is what you would like to do?). And perhaps it is possible to maintain (or re-establish?) this usage. And I don’t want to interfere with your efforts to do this. So I will change the title.
If we can’t find a better solution I will change the title to: Internal Time Pressure. It does not really tell you what the post will be about. But at least it is accurate and not terminologically problematic. And even though the term is not commonly known, Internal Time Pressure is actually the main topic of the post (Internal Time Pressure is the reason that the AI mentioned above, that does all the nice things mentioned, might not actually buy a lot of time).
Regarding current usage of the term Pivotal Act:
It seems to me like you and many others are actually using the term as a shorthand for an AI that removes time pressure from competing AI projects. I can take many examples of this usage just from the discussion that faul_sname links to in the other reply to your comment.
In the second last paragraph of part 1 of the linked post, Andrew_Critch writes:
Overall, building an AGI development team with the intention to carry out a “pivotal act” of the form “forcibly shut down all other A(G)I projects” is probably going to be a rough time, I predict.
No one seems to be challenging that usage of Pivotal Act (even though many other parts of the post are challenged). And it is not just this paragraph. The tl;dr also treats a Pivotal Act as interchangeable with: shut down all other AGI projects, using safe AGI. There are other examples in the post.
In this comment on the post, it seems to me that Scott Alexander is using a Pivotal Act AI as a direct synonym for an AI capable of destroying all competing AI projects.
In this comment it seems to me like you are using Pivotal Act interchangeably with shutting down all competing AI projects. In this comment, it seems to me that you accept the premise that uploading a design team and running them very quickly would be a Pivotal Act (but you question the plan on other grounds). In this comment, it seems to me that you are equating successful AI regulation with a Pivotal Act (but you question the feasibility of regulation).
In this comment, Yudkowsky seems to me to be accepting the premise that preventing all competing AI projects would count as a Pivotal Act. He says that the described strategy for preventing all competing AI projects is not feasible. But he also says that he will change the way he speaks about Pivotal Acts if the strategy actually does work (and this strategy is to shut down competing AI projects with EMPs. The proposed strategy does nothing else to buy time, other than shutting down competing AI projects). (It is not an unequivocal case of using Pivotal Act as a direct synonym for reliably shutting down all competing AI projects. But it really does seem to me like Yudkowsky is treating Pivotal Act as a synonym for: preventing all competing AI projects. Or at least that he is assuming that preventing all competing AI projects would constitute a Pivotal Act).
Consider also example 3 in the arbital page that you link to. Removing time pressure from competing AI projects by uploading a design team is explicitly defined as an example of a Pivotal Act. And the LAI in my post does exactly this. And the LAI in my post also does a lot of other things that increase the probability of a win (such as keeping the designers sane and preventing them from missing an aimed for alignment target).
This usage points to a possible title along the lines of: The AI Actions that are Commonly Referred to as Pivotal Acts, are not Actually Pivotal Acts (or: Shutting Down all Competing AI Projects is not Actually a Pivotal Act). This is longer and less informative about what the post is about (the post is about the need to start ATA work now, because there might not be a lot time to do ATA work later, even if we assume the successful implementation of a very ambitious AI, whose purpose was to buy time). But this title would not interfere with an effort to maintain (or re-establish?) the meaning of Pivotal Act as a synonym for an act that is guaranteed to buy lots of time (which I think is what you are trying to do?). What do you think about these titles?

PS:
(I think that technically the title probably does conform to the specific text bit that you quote. It depends on what the current probability of a win is. And how one defines: drastically increase the probability of a win. But given the probability that Yudkowsky currently assigns to a win, I expect that he would agree that the launch of the described LAI would count as drastically increasing the probability of a win. (In the described scenario, there are many plausible paths along which the augmented humans actually do reach the needed levels of ATA progress in time. They are however not guaranteed to do this. The point of the post is that doing ATA now increases the probability of this happening). The statement that the title conforms to the quoted text bit is however only technically true in an uninteresting sense. And the title conflicts with your efforts to guard the usage of the term. So I will change the title as soon as a new title has been settled on. If nothing else is agreed on, I will change the title to: Internal Time Pressure)
What links here?
- ThomasCederborg's comment on Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure by ThomasCederborg (11 Oct 2024 1:14 UTC; 3 points)

Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure

ThomasCederborg3 Oct 2024 0:01 UTC

12 points

7 comments12 min readLW link

ThomasCederborg 25 Sep 2024 15:22 UTC
8 points
0
in reply to: Charbel-Raphaël’s comment on: The case for more Alignment Target Analysis (ATA)
Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.
Summary
Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It is difficult to predict how long it would take to reach the level of understanding needed to prevent scenarios where a project successfully hits a bad alignment target. And there might not be a lot of time to do ATA later (for example because a tool-AI shuts down all unauthorised AI projects. But does not buy a lot of time due to internal time pressure). So a research effort should start now.
Therefore ATA is one of the current priorities. There are definitely very serious risks that ATA cannot help with (for example misaligned tool-AI projects resulting in extinction). There are also other important current priorities (such as preventing misuse). But ATA is one of the things that should be worked on now.
The next section outlines a few scenarios designed to clarify how I use the term ATA. The section after that outlines a scenario designed to show why I think that ATA work should start now.
What I mean with Alignment Target Analysis (ATA)
The basic idea with ATA is to try to figure out what would happen if a given AI project were to successfully align an autonomously acting AI Sovereign to a given alignment target. The way I use the term, there are very severe risks that cannot be reduced in any way, by any level of ATA progress (including some very serious misalignment and misuse risks). But there are also risks that can and should be reduced by doing ATA now. There might not be a lot of time to do ATA later, and it is not clear how long it will take to advance to the level of understanding that will be needed. So ATA should be happening now. But let’s start by clarifying the term ATA, by outlining a couple of dangerous AI projects where ATA would have nothing to say.
Consider Bill, who plans to use methods from the current paradigm to build a tool-AI. Bill plans to use this tool AI to shut down competing AI projects and then decide what to do next. ATA has nothing at all to say about this situation. Let’s say that Bill’s project plan would lead to a powerful misaligned AI that would cause extinction. No level of ATA progress would reduce this risk.
Consider Bob who also wants to build a tool-AI. But Bob’s AI would work. If the project would go ahead, then Bob would gain a lot of power. And Bob would use that power to do some very bad things. ATA has nothing to say about this project and ATA cannot help reduce this risk.
Now let’s introduce an unusual ATA scenario, just barely within the limits of what ATA can be used for (the next section will give an example of the types of scenarios that makes me think that ATA should be done now. This scenario is meant to clarify what I mean with ATA). Consider Dave who wants to use methods from the current paradigm to implement PCEV. If the project plan moves forwards, then the actual result would be a powerful misaligned AI: Dave’s Misaligned AI (DMAI). DMAI would not care at all what Dave is trying to do, and would cause extinction (for reasons that are unrelated to what Dave was aiming at). One way to reduce the extinction risk from DMAI would be to tell Dave that his plan would lead to DMAI. But it would also be valid to let Dave know that if his project were to successfully hit the alignment target that he is aiming for, then the outcome would be massively worse than extinction.
Dave assumes that he might succeed. So, when arguing against Dave’s project, it is entirely reasonable to argue from the assumption that Dave’s project will lead to PCEV. Pointing out that success would be extremely bad is a valid argument against Dave’s plan, even if success is not actually possible.
You can argue against Dave’s project by pointing out that the project will in fact fail. Or by pointing out that success would be very bad. Both of these strategies can be used to reduce the risk of extinction. And both strategies are cooperative (if Dave is a well meaning and reasonable person, then he would thank you for pointing out either of these aspects of his plan). While both strategies can prevent extinction in a fully cooperative way, they are also different in important ways. It might be the case that only one of these arguments is realistically findable in time. It might for example be the case that Dave is only willing to publish one part of his plan (meaning that there might not be sufficient public information to construct an argument about the other part of the plan). And even if valid arguments of both types are constructed in time, it might still be the case that Dave will only accept one of these arguments. (similar considerations are also relevant for less cooperative situations. For example if one is trying to convince a government to shut down Dave’s project. Or if one is trying to convince an electorate to vote no on a referendum that Dave needs to win in order to get permission to move forwards)
The audience in question (Dave, bureaucrats, voters, etc) are only considering the plan because they believe that it might result in PCEV. Therefore it is entirely valid to reason from the assumption that Dave’s plan will result in PCEV (when one is arguing against the plan). There is no logical reason why such an argument would interfere with attempts to argue that Dave’s plan would in fact result in DMAI.
Now let’s use an analogy from the 2004 CEV document to clarify what role I see an ATA project playing. In this analogy, building an AI Sovereign is analogous to taking power in a political revolution. So (in the analogy) Dave proposes a political revolution. One way a revolution can end in disaster is that the revolution leads to a destructive civil war that the revolutionaries loose (analogous to DMAI causing extinction). Another way a revolution can end in disaster is that ISIS takes power after the government is overthrown (analogous to the outcome implied by PCEV).
It is entirely valid to say to Dave: ``if you actually do manage to overthrow the government, then ISIS will seize power″ (assuming that this conditional is true). One can do this regardless of whether or not one thinks that Dave has any real chance of overthrowing the government. (Which in turn means that one can actually say this to Dave, without spending a lot of time trying to determine the probability that the revolution will in fact overthrow the government. Which in turn means that people with wildly different views on how difficult it is to overthrow the government can cooperate while formulating such an argument)
(this argument can be made separately from an argument along the lines of: ``our far larger neighbour has a huge army and would never allow the government of our country to be overthrown. Your revolution will fail even if every single soldier in our country joins you instantly. Entirely separately: the army of our county is in fact fiercely loyal to the government and you don’t have enough weapons to defeat it. In addition to these two points: you are clearly bad at strategic thinking and would be outmanoeuvred in a civil war by any semi-competent opponent″. This line of argument can also prevent a hopeless civil war. The two arguments can be made separately and there is no logical reason for them to interfere with each other)
Analysing revolutionary movements in terms of what success would mean can only help in some scenarios. It requires a non-vague description of what should happen after the government falls. In general: this type of analysis cannot reduce the probability of lost civil wars, in cases where the post revolutionary strategy is either (i): too vaguely described to analyse, or (ii): actually sound (meaning that the only problem with the revolution in question is that it has no chance of success). Conversely however: arguments based on revolutions failing to overthrow the government cannot prevent revolutions that would actually end with ISIS in charge (analogous to AI projects that would successfully hit a bad alignment target). Scenarios that end in a bad alignment target getting successfully hit is the main reason that I think that ATA should happen now (in the analogy, the main point would be to reduce the probability of ISIS gaining power). Now let’s leave the revolution analogy and outline one such scenario.
A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time
It is difficult to predict who might end up controlling a tool-AI. But one obvious compromise would be to put it under the control of some group of voters (for example a global electorate). Let’s say that the tool-AI is designed such that one needs a two thirds majority in a referendum, to be allowed to launch a Sovereign AI. There exists a Sovereign AI proposal that a large majority thinks sounds nice. A small minority would however prefer a different proposal.
In order to prevent inadvertent manipulation risks, the tool AI was designed to only discuss topics that are absolutely necessary for the process of shutting down unauthorised AI projects. Someone figures out how to make the tool-AI explain how to implement Sovereign AI proposals (and Explanation / Manipulation related definitions happens to hold for such discussions). But no one figures out how to get it to discuss any topic along the lines of ATA. The original plan was to take an extended period of time to work on ATA before implementing a Sovereign AI.
Both alignment targets use the same method for extrapolating people and for resolving disagreements. The difference is in terms of who is part of the initial group. The two proposals have different rules with respect to things like: animals, people in cryo, foetuses, artificial minds, etc. It doesn’t actually matter which proposal gets implemented: the aggregation method leads to the same horrific outcome in both cases (due to an issue along the lines of the issue that PCEV suffers from. But more subtle and difficult to notice). (All proposed alignment targets along the lines of ``build an AI Sovereign that would do whatever some specific individual wants it to do″ are rejected out of hand by almost everyone).
In order to avoid making the present post political, let’s say that political debates center around what to do with ecosystems. One side cares about nature and wants to protect ecosystems. The other side wants to prevent animal suffering (even if the cost of such prevention is the total destruction of every ecosystem on earth). It is widely assumed that including animals in the original group will lead to an outcome where animal suffering is prevented at the expense of ecosystems. (in order to make the following scenario more intuitive, readers that have an opinion regarding what should be done with ecosystems, can imagine that the majority shares this opinion)
The majority has enough support to launch their Sovereign AI. But the minority is rapidly and steadily gaining followers due to ordinary political dynamics (sometimes attitudes on a given issue changes steadily in a predictable direction). So the ability to get the preferred alignment target implemented can disappear permanently at any moment (the exact number of people that would actually vote yes in a referendum is difficult to estimate. But it is clearly shrinking rapidly). In this case the majority might act before they loose the ability to act. Part of the majority would however hesitate if the flaw with the aggregation method is noticed in time.
After the tool-AI was implemented, a large number of people started to work on ATA. There are also AI assistants that contribute to conceptual progress (they are tolerated by the tool-AI because they are not smarter than humans. And they are useful because they contribute a set of unique non-human perspectives). However, it turns out that ATA progress works sort of like math progress. It can be sped up significantly by lots of people working on it in parallel. But the main determinant of progress is how long people have been working on it. In other words: it turns out that there is a limit to how much the underlying conceptual progress can be sped up by throwing large numbers of people at ATA. So the question of whether or not the issue with the Sovereign AI proposal is noticed in time, is to a large degree determined by how long a serious ATA research project has been going on at the time that the tool-AI is launched (in other words: doing ATA now reduces the risk of a bad alignment target ending up getting successfully hit in this scenario).
(the idea is not that this exact scenario will play out as described. The point of this section was to give a detailed description of one specific scenario. For example: the world will presumably not actually be engulfed by debates about the Prime Directive from Star Trek. And a tool-AI controlled by a messy coalition of governments might lead to a time crunch due to dynamics that are more related to Realpolitik than any form of ideology. This specific scenario is just one example of a large set of similar scenarios)
PS:
On a common sense level I simply don’t see how one can think that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction). The fact that there exists no research project dedicated to improving this situation seems like a mistake. Intuitively this seems like a dangerous situation. At the very least it seems like some form of positive argument would be needed before concluding that this is safe. And it seems like such an argument should be published so that it can be checked for flaws before one starts acting based on the assumption that the current situation is safe. Please don’t hesitate to contact me with theories / questions / thoughts / observations / etc regarding what people actually believe about this.
What links here?

ThomasCederborg 21 Sep 2024 14:40 UTC
3 points
2
in reply to: Richard_Ngo’s comment on: The case for more Alignment Target Analysis (ATA)
I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?

I also have a couple of questions about the linked text:
How do you define the difference between explaining something and trying to change someone’s mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob’s entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes would be in a completely different direction (compared to the straightforward answer). Refusing to answer, while being honest about the reason for refusal, would send Bob into a tailspin. How certain are you that you can find a definition of Acceptable Forms of Explanation that holds up in a large number of messy situations along these lines? See also this.
And if you cannot define such things in a solid way, how do you plan to define ``benefit humanity″? PCEV was an effort to define ``benefit humanity″. And PCEV has been found to suffer from at least one difficult-to-notice problem. How certain are you that you can find a definition of ``benefit humanity″ that does not suffer from some difficult-to-notice problem?
PS:
Speculation regarding where novel alignment target proposals are likely to come from are very welcome. It is a prediction of things that will probably be fairly observable fairly soon. And it is directly relevant to my work. So I am always happy to hear this type of speculation.

ThomasCederborg

A very brief outline of the argument for analysing Sovereign AI proposals now:

Technical note on scope:

A tangent:

Regarding your comment about CEV:

Regarding your comment about what type of alignment target is most likely to be pursued first:

The rule for determining which set of actions will be included in negotiations between delegates

Coherent Extrapolation of Equanimous Volition (CEEV)

How useful would an IFAI be for analysing MPSAIPs?

The delegation plan: The scenario where an IFAI does know how to define all of humanity’s long term, implicit deep values

The assistant plan: The scenario where an IFAI does not know how to define all of humanity’s long term, implicit deep values

Regarding the probability of a pause

An attempt to summarise how I view the situation

PS:

Regarding the probability that extrapolation will change Bob:

Regarding successor AIs:

Regarding the last paragraph that talks about finding a clever solution:

A few tangents:

Regarding extrapolation:

Regarding Corrigibility as a Singular Target:

Regarding what the designers might want:

A problem shared by many different alignment targets

BPA will still want to find and hurt heretics

BPA will likely be able to find Steve

There is still nothing protecting Steve from BPA

Memory wipes would complicate extrapolation

Your proposed resource destruction mechanism

Regarding AI assisted negotiations

The proposal to have a simulated war that destroys resources

The Cosmic Block and a specific set of tests

I think that it would be better to just not create such a set of AIs

Why the old title was a mistake

A scenario that better illustrates why a Pivotal Act AI might not buy a lot of time

PS:

Regarding current usage of the term Pivotal Act:

Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure

Summary

What I mean with Alignment Target Analysis (ATA)

A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time

PS:

ThomasCederborg

A very brief outline of the argument for analysing Sovereign AI proposals now:

Technical note on scope:

A tangent:

Regarding your comment about CEV:

Regarding your comment about what type of alignment target is most likely to be pursued first:

The rule for determining which set of actions will be included in negotiations between delegates

Coherent Extrapolation of Equanimous Volition (CEEV)

How useful would an IFAI be for analysing MPSAIPs?

The delegation plan: The scenario where an IFAI does know how to define all of humanity’s long term, implicit deep values

The assistant plan: The scenario where an IFAI does not know how to define all of humanity’s long term, implicit deep values

Regarding the probability of a pause

An attempt to summarise how I view the situation

PS:

Regarding the probability that extrapolation will change Bob:

Regarding successor AIs:

Regarding the last paragraph that talks about finding a clever solution:

A few tangents:

Regarding extrapolation:

Regarding Corrigibility as a Singular Target:

Regarding what the designers might want:

A prob­lem shared by many differ­ent al­ign­ment targets

BPA will still want to find and hurt heretics

BPA will likely be able to find Steve

There is still nothing protecting Steve from BPA

Memory wipes would complicate extrapolation

Your proposed resource destruction mechanism

Regarding AI assisted negotiations

The proposal to have a simulated war that destroys resources

The Cosmic Block and a specific set of tests

I think that it would be better to just not create such a set of AIs

Why the old title was a mistake

A scenario that better illustrates why a Pivotal Act AI might not buy a lot of time

PS:

Regarding current usage of the term Pivotal Act:

Shut­ting down all com­pet­ing AI pro­jects might not buy a lot of time due to In­ter­nal Time Pressure

Summary

What I mean with Alignment Target Analysis (ATA)

A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time

PS:

A problem shared by many different alignment targets

Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure