My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky’s description of the issue you can search the CEV arbital page for ADDED 2023.
The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.
I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don’t hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It’s a Gavagai / Word and Object joke from my grad student days)
My background is physics as an undergrad and then AI research. Links to some papers: P1 P2 P3 P4 P5 P6 P7 P8. (no connection to any form of deep learning)
It seems to me that we are going in circles and talking past each other to some degree in the discussion above. So I will just briefly summarise my position on the main topics that you raise (I’ve argued for these positions above. Here I’m just summarising). And then I will give a short outline of the argument for analysing Sovereign AI proposals now.
Regarding the relative priority of different research efforts:
The type of analysis that I am doing in the post is designed to reduce one of the serious AI risks that we face. This risk is due to a combination of the fact that (i): we might end up with a successfully implemented Sovereign AI proposal that has not been analysed properly, and the fact that (ii): the successful implementation of a reasonable sounding Sovereign AI proposal might lead to a massively worse than extinction outcome. In other words: reducing the risk of a massively worse than extinction outcome is a tractable research project (specifically: this risk can be reduced by analysing the types of alignment targets that the post is analysing). This research project is currently not being pursued. Other efforts are needed to reduce other types of risks. And it is certainly possible for reasonable people to disagree substantially on how attention would best be allocated. But it still seems very clear to me that the current situation is a serious mistake.
I don’t actually know what the optimal allocation of attention would be. But I have been in contact with a lot of people during the last few years. And I have never gotten any form of pushback when I say that there currently exists exactly zero people in the world dedicated to the type of analysis that I am talking about. So whatever the optimal ratio is, I am confident that the type of analysis that I am advocating for deserves more attention. (It might of course be perfectly reasonable for a given AI safety researcher to decide to not personally pursue this type of analysis. But I am confident that the overall situation is not reasonable. It simply cannot be reasonable to have zero people dedicated to a tractable research project, that reduces the probability of a massively worse than extinction outcome).
Regarding the type of Instruction Following AGI (IFAGI) that you mention:
The successful implementation of such an IFAGI would not reliably prevent a Sovereign AI proposal from being successfully implemented later. And this Sovereign AI proposal might be implemented before it has been properly analysed. Which means that the IFAGI idea does not remove the need for the type of risk-mitigation focused research project that the post is an example of. In other words: Such an IFAGI might not result in a lot of time to analyse Sovereign AI proposals. And such an IFAGI might not be a lot of help when analysing Sovereign AI proposals. So even if we assume that an IFAGI will be successfully implemented, then this would still not remove the need for the type of analysis that I am talking about. (Conditioned on such an IFAGI being successfully implemented, we might get a lot of time. And we might get a lot of help with analysis. But we might also end up in a situation where we do not have much time, and where the IFAGI does not dramatically increase our ability to analyse Sovereign AI proposals)
Regarding perfect solutions and provably safe AI:
I am not trying to do anything along the lines of proving safety. What I am trying to do is better described as trying to prove un-safety. I look at some specific proposed AI project plan. (For example an AI project plan along the lines of: first humans are augmented. Then those augmented humans builds some form of non-Sovereign AI. And then they use that non-Sovereign AI to build an AI Sovereign, that implements the CEV of Humanity). And then I explain why the success of this project would be worse than extinction (in expectation. From the perspective of a human individual. For the reasons outlined in the post). So I am in some sense looking for definitive answers. But more along the lines of provable catastrophe than provable safety. What I am trying to do is a bit like attempting to conclusively determine that a specific location contains a landmine (where a specific AI project plan being successfully implemented, is analogous to a plan that ends with someone standing on the location of a specific landmine). It is very different from attempting to conclusively determine that a specific path is safe. (Just wanted to make sure that this is clear).
A very brief outline of the argument for analysing Sovereign AI proposals now:
Claim 1: We might end up with a successfully implemented AI Sovereign. Even if the first clever thing created is not an AI Sovereign, an AI Sovereign might be developed later. Augmented humans, non-Sovereign AIs, etc, might be followed by an AI Sovereign. (See for example the proposed path to an AI Sovereign described on the CEV arbital page).
Claim 2: In some scenarios that end in a successfully implemented AI Sovereign, we will not get a lot of time to analyse Sovereign AI proposals. (For example due to Internal Time Pressure. See also this subsection for an explanation of why shutting down competing AI projects might not buy a lot of time. See also the last section of this comment, which outlines one specific scenario where a tool-AI successfully shuts down all unauthorised AI projects, but does not buy a lot of time).
Claim 3: In some scenarios that end in a successfully implemented AI Sovereign, we will not get a lot of help with analysis of Sovereign AI proposals. (Partly because asking an AI for a good Sovereign AI proposal is like asking an AI what goal it should have. See also this subsection on the idea of having AI assistants helping with analysis. This subsection and this section argues that augmented humans might turn out to be good at hitting alignment targets, but not good at analysing alignment targets).
Claim 4: A reasonable sounding Sovereign AI proposal might lead to a massively worse than extinction outcome. (See for example the PCEV thought experiment).
Claim 5: Noticing such issues is not guaranteed. (For example illustrated by the fact that the problem with PCEV went unnoticed for many years).
Claim 6: Reducing the probability of such outcomes is possible. Reducing this probability is a tractable research project, because risk can be reduced without finding any good Sovereign AI proposals. (For example illustrated by the present post, or the PCEV thought experiment).
Claim 7: There exists exactly zero people in the world dedicated to this tractable way of reducing the probability of a massively worse than extinction outcome. (It is difficult to prove the non-existence of something. But I have been saying this for quite a while now, while talking to a lot of different people. And I have never gotten any form of pushback on this).
Conclusion: We might end up in a worse than extinction outcome, because a successfully implemented Sovereign AI proposal has a flaw that was realistically findable. It would make sense to spend a non-tiny amount of effort on reducing the probability of this.
(People whose intuition says that this conclusion must surely be false in some way, could try to check whether or not this intuition is actually based on anything real. The most straightforward way would be to spell out the actual argument for this in public, so that the underlying logic can be checked. Acting based on the assumption that such an intuition is based on anything real, without at least trying to evaluate it first, does not sound like a good idea)