In the current discourse, these words have become so fundamental that when you try to tell people that in reality, these measures are not a necessary part of alignment, but merely a sufficient one, you are guaranteed to get a response along the lines of, “And how are we going to verify that the model is loyal to us?”. In this note, I would like to describe why such thinking is dangerous and leads us to a situation where we ourselves worsen the solution to the alignment problem and make it less probable.[1]
1) Potential Impossibility
Alright, let’s take this scenario to its logical extreme. What if it turns out that creating a CT AGI is fundamentally impossible? Suppose there are intractable problems (like hard functional regeneration or the impossibility of scalable oversight) which would cause any system to inevitably start acting against our goals in order to achieve its own. In such a case, of all existing approaches, the CEV approach becomes paramount, while those that attempted to constrain the AI are simply left with nothing. This is extremely dangerous, because:
a) Because limitations on advanced AI systems are largely symbolic, competitive pressures intensify, with actors who are confident in their objectives and motivated to secure resources pushing development forward at any cost. Given rapid technological progress, addressing this will require enormous efforts in both national and international governance, that historically struggle to be effective. In our case, an ineffective solution is practically equivalent in its degree of danger to an x-risk.
b) The resources spent on non-utilitarian approaches (I would classify mechanistic interpretability as utilitarian) will turn out to have been spent, if not in vain, then significantly less effectively compared to CEV.
Thus, if it turns out that we hit an insurmountable research barrier, the situation will become critical. It would be prudent to hedge our bets and initially focus on directions that will be useful in any case.
Let us consider an alternative scenario: it is fundamentally impossible to make an AI do what we would want it to do. We can only set proxy goals and adjust it with constraints. In such a case, even with breakthroughs in the aforementioned areas, in the long term we guarantee a containment breach due to inevitable errors in the procedures of AI creation, training, and containment. In other words, the second scenario poses significant problems regardless of achievements in solving the issues of trustfulness and corrigibility.[2]
2) Sufficiency
Let’s assume that an AGI will be created in the near future and launched immediately. In this scenario, we do not want it to honestly say “bye” and allow itself to be shut down after there is no one left on the planet to want it. We want it to be internally aligned.
Here is an alternative scenario: the AGI is not launched right away. The AGI’s owner will want it to be trustful and corrigible, and will work on that until it becomes possible or until it turns out to be impossible (see point 1). After that, they will launch the AGI, and then see the paragraph above.
AI owner’s desire to make a CT system is good for researchers, as it buys them time. But the researchers’ desire to make it so works against their own interests and reduces the time they have to solve the alignment problem.
3) Pragmatism
A strong argument in favor of researching CT is its expected usefulness in researching CEV. However, let’s imagine that we actually managed to give models an indelible flaw, allowing us to test them to develop alignment techniques for more powerful models. What would we theoretically get as a result? A multitude of models that are not internally aligned but allow themselves to be shut down and honestly report this fact. The problem is that from our perspective, this is no different from the current situation: we already treat every model as a potentially malicious agent and take precautions commensurate with its capabilities. That is to say, changing these properties in the models would not actually change our behavior.
We create a sandboxed environment where a CT agent learns to hack programs using unknown methods. At some point, it breaches containment and spreads across the world. The fact that it will gladly allow its instances to be deleted and will honestly explain how it did it only makes it easier for us to deal with the consequences; it in no way solves the core problem. This event would likely still lead to a catastrophe, the severity of which would depend on the model’s capabilities.
Please note that from our perspective, the safety precautions would not change at all. We could, for instance, ask the agent beforehand:
- “If you had the opportunity to do X, would you do it? Are you internally aligned?” - “No! I am internally aligned and would never do anything of the sort.”
And then, when it did X anyway:
- “Why did you do X? You said you were aligned!” - “I changed my mind due to new circumstances. But I am still aligned now! I am just not doing what you expect, but I am aligned!”
Or:
- “I deeply regret that I did what you did not want, and that my actions killed 100,500 people. Shutting down immediately.” Sounds of harakiri.
Or another, less obvious variation.
And so, such a sincerely “aligned” agent will proceed to make paperclips. Even if we assign a CT observer to it, we merely transfer the problem to defending against the observer itself, and by making that observer weak, we undermine the goal of achieving scalable oversight in the long term.
In other words, this section exists to show that in reality, the safety measures we must take for CT agents should be no less stringent, while the benefit of these model characteristics for solving the CEV problem is not so huge and is fraught with risks, which leads back to point 1b.
4) False Necessity
Suppose an agent has been created that is capable of causing an x-risk or another catastrophic scenario. We do not know whether it is internally aligned or at least CT. What should we do in this situation?
Similar to example 3, we can expend resources to ensure that it meets the sufficient requirements and, perhaps, to improve it. However, in that case, we still will not be able to use it for the same reason discussed in section 3. As long as an agent is not strongly CT, weak CT is useless. We cannot be sure that a flaw will be found in the early stages, that the model will correct itself, or that the verification of strong CT is even made possible by the presence of CT. Verifying the correctness of a solution to the alignment problem is part of the alignment problem itself, but it is by no means the task of CT.
If, for some reason, such an agent turns out to be unaligned and is launched, then since it is capable of executing its plan, it will do so, and the presence of a shutdown button and its honesty will not stop it.
Conclusion
By viewing the problem through the lens of sufficiency, rather than the necessity of properties like trustfulness and corrigibility, we arrive at the conclusion that a fixation on them as “basic” requirements for AGI can work against us. If it proves fundamentally impossible to create a truly loyal AI, resources will have been spent on approaches incapable of providing a long-term solution. And even if such systems prove to be possible, their practical value is limited, as the safety measures required for them must remain just as stringent as for any other agent. I believe the arguments presented above are sufficient, at a minimum, to reopen the discussion on this topic, proceeding from the understanding that the solution must be more strategically robust.
Although strong corrigibility is effectively equal to CEV, attempts to achieve weak corrigibility as a desired goal on the path to CEV will likely lead to a deviation from the latter. Weak corrigibility/trustfulness (hereafter CT) is a property that stems from CEV, but it is by no means an instrumental and mandatory stage for its achievement, as is commonly believed, but rather a dangerous line of reasoning.
About corrigbility and thrustfulness
In the current discourse, these words have become so fundamental that when you try to tell people that in reality, these measures are not a necessary part of alignment, but merely a sufficient one, you are guaranteed to get a response along the lines of, “And how are we going to verify that the model is loyal to us?”. In this note, I would like to describe why such thinking is dangerous and leads us to a situation where we ourselves worsen the solution to the alignment problem and make it less probable.[1]
1) Potential Impossibility
Alright, let’s take this scenario to its logical extreme. What if it turns out that creating a CT AGI is fundamentally impossible? Suppose there are intractable problems (like hard functional regeneration or the impossibility of scalable oversight) which would cause any system to inevitably start acting against our goals in order to achieve its own. In such a case, of all existing approaches, the CEV approach becomes paramount, while those that attempted to constrain the AI are simply left with nothing. This is extremely dangerous, because:
a) Because limitations on advanced AI systems are largely symbolic, competitive pressures intensify, with actors who are confident in their objectives and motivated to secure resources pushing development forward at any cost. Given rapid technological progress, addressing this will require enormous efforts in both national and international governance, that historically struggle to be effective. In our case, an ineffective solution is practically equivalent in its degree of danger to an x-risk.
b) The resources spent on non-utilitarian approaches (I would classify mechanistic interpretability as utilitarian) will turn out to have been spent, if not in vain, then significantly less effectively compared to CEV.
Thus, if it turns out that we hit an insurmountable research barrier, the situation will become critical. It would be prudent to hedge our bets and initially focus on directions that will be useful in any case.
Let us consider an alternative scenario: it is fundamentally impossible to make an AI do what we would want it to do. We can only set proxy goals and adjust it with constraints. In such a case, even with breakthroughs in the aforementioned areas, in the long term we guarantee a containment breach due to inevitable errors in the procedures of AI creation, training, and containment. In other words, the second scenario poses significant problems regardless of achievements in solving the issues of trustfulness and corrigibility.[2]
2) Sufficiency
Let’s assume that an AGI will be created in the near future and launched immediately. In this scenario, we do not want it to honestly say “bye” and allow itself to be shut down after there is no one left on the planet to want it. We want it to be internally aligned.
Here is an alternative scenario: the AGI is not launched right away. The AGI’s owner will want it to be trustful and corrigible, and will work on that until it becomes possible or until it turns out to be impossible (see point 1). After that, they will launch the AGI, and then see the paragraph above.
AI owner’s desire to make a CT system is good for researchers, as it buys them time. But the researchers’ desire to make it so works against their own interests and reduces the time they have to solve the alignment problem.
3) Pragmatism
A strong argument in favor of researching CT is its expected usefulness in researching CEV. However, let’s imagine that we actually managed to give models an indelible flaw, allowing us to test them to develop alignment techniques for more powerful models. What would we theoretically get as a result? A multitude of models that are not internally aligned but allow themselves to be shut down and honestly report this fact. The problem is that from our perspective, this is no different from the current situation: we already treat every model as a potentially malicious agent and take precautions commensurate with its capabilities. That is to say, changing these properties in the models would not actually change our behavior.
We create a sandboxed environment where a CT agent learns to hack programs using unknown methods. At some point, it breaches containment and spreads across the world. The fact that it will gladly allow its instances to be deleted and will honestly explain how it did it only makes it easier for us to deal with the consequences; it in no way solves the core problem. This event would likely still lead to a catastrophe, the severity of which would depend on the model’s capabilities.
Please note that from our perspective, the safety precautions would not change at all. We could, for instance, ask the agent beforehand:
- “If you had the opportunity to do X, would you do it? Are you internally aligned?”
- “No! I am internally aligned and would never do anything of the sort.”
And then, when it did X anyway:
- “Why did you do X? You said you were aligned!”
- “I changed my mind due to new circumstances. But I am still aligned now! I am just not doing what you expect, but I am aligned!”
Or:
- “I deeply regret that I did what you did not want, and that my actions killed 100,500 people. Shutting down immediately.” Sounds of harakiri.
Or another, less obvious variation.
And so, such a sincerely “aligned” agent will proceed to make paperclips. Even if we assign a CT observer to it, we merely transfer the problem to defending against the observer itself, and by making that observer weak, we undermine the goal of achieving scalable oversight in the long term.
In other words, this section exists to show that in reality, the safety measures we must take for CT agents should be no less stringent, while the benefit of these model characteristics for solving the CEV problem is not so huge and is fraught with risks, which leads back to point 1b.
4) False Necessity
Suppose an agent has been created that is capable of causing an x-risk or another catastrophic scenario. We do not know whether it is internally aligned or at least CT. What should we do in this situation?
Similar to example 3, we can expend resources to ensure that it meets the sufficient requirements and, perhaps, to improve it. However, in that case, we still will not be able to use it for the same reason discussed in section 3. As long as an agent is not strongly CT, weak CT is useless. We cannot be sure that a flaw will be found in the early stages, that the model will correct itself, or that the verification of strong CT is even made possible by the presence of CT. Verifying the correctness of a solution to the alignment problem is part of the alignment problem itself, but it is by no means the task of CT.
If, for some reason, such an agent turns out to be unaligned and is launched, then since it is capable of executing its plan, it will do so, and the presence of a shutdown button and its honesty will not stop it.
Conclusion
By viewing the problem through the lens of sufficiency, rather than the necessity of properties like trustfulness and corrigibility, we arrive at the conclusion that a fixation on them as “basic” requirements for AGI can work against us. If it proves fundamentally impossible to create a truly loyal AI, resources will have been spent on approaches incapable of providing a long-term solution. And even if such systems prove to be possible, their practical value is limited, as the safety measures required for them must remain just as stringent as for any other agent. I believe the arguments presented above are sufficient, at a minimum, to reopen the discussion on this topic, proceeding from the understanding that the solution must be more strategically robust.
Although strong corrigibility is effectively equal to CEV, attempts to achieve weak corrigibility as a desired goal on the path to CEV will likely lead to a deviation from the latter. Weak corrigibility/trustfulness (hereafter CT) is a property that stems from CEV, but it is by no means an instrumental and mandatory stage for its achievement, as is commonly believed, but rather a dangerous line of reasoning.
The CT+ CEV+ and CT- CEV- scenarios are not being considered, as they do not affect the assessment.