Alignment is difficult because (1) alignment and capabilities are intertwined (alignment research helping capabilities), (2) we don’t have a process to verify what good ideas or progress look like, and we likely get (3) only one critical try. He already addresses many of the counterarguments that are getting brought up recently.
(1) Without any strong governance, a lot of alignment work will also help with capabilities, potentially even more so. This goes for interpretability or AIs doing R&D for alignment. Interpretability could lead to recursive self-improvement, more efficient AIs. AIs doing R&D for capabilities is probably much more straightforward than AIs doing alignment research. If we wanted to use something like superalignment, we would need strong governance to make sure nobody is trivially asking the same agents to do capabilities research.
(2) It is still a common objection that current models seem to be able to reason about morality, and that therefore alignment must be relatively easy. Nate thinks that this mostly just tells us how well the AIs are able to understand us. I personally think the situation in AI alignment has probably gotten worse since then, with even more of the relative effort being focused on brand-safety related issues.
While there are a bunch of people saying they have different plans, that does not actually mean that we have a plan. It largely just confuses the whole situation. What he describes here feels exactly like the current situation.
(3) One critical try
Nate argues that once “AI is capable of autonomous scientific/technological development” where it can “gain a decisive strategic advantage over the rest of the planet,” you are operating in a very different environment than ever before. Since the AI in this regime could potentially kill you, you need to get it right on the first try, and that is really difficult.
One objection he addresses is that you could try to trick a weaker AI into thinking it could take over. However, according to Nate, if we come up with some complex method to potentially test whether a system would like to take over, we still rely on that working on the first critical try. This goes against the more modern idea of AI control, which came out in December 2023. I would add that these “tricking the weaker AIs into trying to take over” strategies have at least two key problems: (1) these AIs are still weaker than the real thing, (2) you are trying to gather empirical data from observing something smarter than you. For example, we could see an AI pretending to be tricked and not taking over.
I think people often also have a second objection that Nate didn’t mention, namely that we could play the AIs against each other in some form such that no AI gets a decisive strategic advantage at any point. This also seems to rely on such a scheme working on the first critical try. I also assume that such a method is not particularly promising if you can’t reliably align the first generation of AIs and decision theory favors alliances between smart agents.
I read this older post by Nate Soares from 2023, AI as a Science, and Three Obstacles to Alignment Strategies, a pretty prescient overview of challenges in alignment research.
Alignment is difficult because (1) alignment and capabilities are intertwined (alignment research helping capabilities), (2) we don’t have a process to verify what good ideas or progress look like, and we likely get (3) only one critical try. He already addresses many of the counterarguments that are getting brought up recently.
(1) Without any strong governance, a lot of alignment work will also help with capabilities, potentially even more so. This goes for interpretability or AIs doing R&D for alignment. Interpretability could lead to recursive self-improvement, more efficient AIs. AIs doing R&D for capabilities is probably much more straightforward than AIs doing alignment research. If we wanted to use something like superalignment, we would need strong governance to make sure nobody is trivially asking the same agents to do capabilities research.
(2) It is still a common objection that current models seem to be able to reason about morality, and that therefore alignment must be relatively easy. Nate thinks that this mostly just tells us how well the AIs are able to understand us. I personally think the situation in AI alignment has probably gotten worse since then, with even more of the relative effort being focused on brand-safety related issues.
While there are a bunch of people saying they have different plans, that does not actually mean that we have a plan. It largely just confuses the whole situation. What he describes here feels exactly like the current situation.
(3) One critical try
Nate argues that once “AI is capable of autonomous scientific/technological development” where it can “gain a decisive strategic advantage over the rest of the planet,” you are operating in a very different environment than ever before. Since the AI in this regime could potentially kill you, you need to get it right on the first try, and that is really difficult.
One objection he addresses is that you could try to trick a weaker AI into thinking it could take over. However, according to Nate, if we come up with some complex method to potentially test whether a system would like to take over, we still rely on that working on the first critical try. This goes against the more modern idea of AI control, which came out in December 2023. I would add that these “tricking the weaker AIs into trying to take over” strategies have at least two key problems: (1) these AIs are still weaker than the real thing, (2) you are trying to gather empirical data from observing something smarter than you. For example, we could see an AI pretending to be tricked and not taking over.
I think people often also have a second objection that Nate didn’t mention, namely that we could play the AIs against each other in some form such that no AI gets a decisive strategic advantage at any point. This also seems to rely on such a scheme working on the first critical try. I also assume that such a method is not particularly promising if you can’t reliably align the first generation of AIs and decision theory favors alliances between smart agents.