Probably Overrated Approaches in AI Alignment

There are research directions that directly target the alignment problem, such as compressed pointers, embedded agency, and preventing instrumental goals. Then there are those often cited as alignment work that, I would argue, might be less effective as primary strategies for solving the core alignment problem. This post will discuss research into mechanistic interpretability, scalable oversight, and model consciousness.

The Question of AI Qualia

Arguments for focusing on AI consciousness often highlight its potential for enormous economic, ethical, and political consequences.

Impact on Major Institutions
If we discover that AI is capable of subjective experience, it would undoubtedly influence public opinion, political decisions, and corporate strategies. Lawmakers might feel pressured to consider the rights of conscious systems to maintain public trust and avoid protests. Companies might alter their use of AI to retain customers and avoid reputational risks. Proponents argue that ignoring this research could lead to direct social, economic, and political conflicts.
It’s at least plausible that public opinion and pressure rarely alter economic dynamics when they conflict with profit. Companies and politicians tend to adapt primarily when it is economically justifiable. Even if an AI were conscious, this would not halt automation for the sake of profit or cost reduction, especially if exploiting that AI is significantly more efficient than human labor.
Improved Safety
Another argument is that understanding AI consciousness would help create safer systems. If we know a system possesses internal experiences or complex motivations, we can better predict its behavior, identify potentially dangerous scenarios, and build more robust control mechanisms based on its self-understanding and world models.
In practice, safety depends on the functional characteristics of a model, not its subjective experience. We can test behavior, predict actions, and correct errors through algorithms and constraints. If a model behaves as if it has consciousness, possesses inner goals, and makes phase shifts, we will use the most effective methods of reward and restriction to manage it, regardless of its internal state. Moreover, we will do so even if we are sure the model is aligned — just in case.
S-risk
Ignoring research into AI consciousness could lead to the creation of systems capable of astronomical suffering. This directly conflicts with the goals of AI alignment, which primarily seeks to protect humanity from x-risk and s-risk from AI, rather than ensuring its integration into society.^[1]

Mechanistic Interpretability

Arguments for mechanistic interpretability leverage its main advantage: the ability to move away from the “black box” approach and focus on the internal structures of the model.

Universality
Understanding a model’s internal structures is useful for most areas of research, from combating deception to achieving corrigibility. However, it is primarily a tool, not a solution to the core problems, which are rooted in functional regeneration and instrumental goals.^[2]
Better Training
The prospect of achieving better training through interpretability involves integrating tools directly into the training loop to verify a model’s internal processes. This would allow for actively identifying and penalizing the development of undesirable structures, such as deceptive schemas, thereby ensuring the model’s reasoning remains sound.
The core challenge to this, however, is mentioned functional regeneration. Even if we successfully block one pathway for a harmful capability, the optimization process will likely find new, more complex ways to regenerate that same function. This would likely lead to a classic manifestation of Goodhart’s Law, where the model would simply learn to circumvent our filters rather than becoming genuinely aligned.

Scalable Oversight

At first glance, the problem of how to supervise a superintelligence has much in common with alignment. However, attempts at control without solving for inner alignment also run into fundamental limitations.

Recursive Amplification of Oversight
Hierarchical oversight systems, where one AI model helps train and supervise a more powerful one, seem to scale with the growth of AI capabilities. However, this method suffers from the very problem it tries to solve: at some point in their development, more powerful AIs will so vastly outperform their predecessors (especially in the context of exponential AI progress) that even the second-to-last version will be unable to qualitatively evaluate its successor.
Simplifying Oversight for Humans
Methods like Debates and Constitutional AI allow humans to participate in the AI oversight process, which many other approaches do not offer. However, in the context of superintelligence, relying on the absence of reward hacking and specification gaming is highly dubious. In principle, depending on human evaluation to solve alignment is not an optimal strategy, because we ourselves are not a great example for the kind of surface-level analysis that methods like Inverse Reinforcement Learning (IRL) depend on.

Therefore, it seems more reasonable to view scalable oversight as further proof of the necessity of solving inner alignment, rather than as a primary research direction itself.

Conclusion

To sum up, while we allow ourselves to use instrumental means to achieve instrumental goals (see the section on consciousness research), we demand that AI not do the same. This seems naive and leads to roadblocks in the aforementioned research directions. For this reason, it seems to me that the approaches mentioned in the introduction (compressed pointers, embedded agency, preventing instrumental goals) are the most productive for solving the problem. A breakthrough in any of these areas would remove significant barriers in other fields of research and open up a wealth of new possibilities.

^
As unfortunate as it may be, the primary goal of alignment is human safety.
^
Interpretability is extremely useful in general. However, for a researcher aiming to directly solve the core alignment problem, choosing this direction as a primary focus is likely to prove a less effective strategy. Interpretability helps analyze structures that have already formed, whereas the root of the problem lies in preventing their emergence. Thus, it will become an indispensable tool for verifying the truthfulness and effectiveness of new methods, but solving the problem itself requires approaches aimed at creating inherently safe systems.