Miscellaneous First-Pass Alignment Thoughts

I have been learning more about alignment theory in the last couple of months, and have heard from many people that writing down naive hypotheses can be a good strategy for developing your thoughts and getting feedback about them. So here goes:

I am skeptical that any training mechanism that we invent will so consistently avoid inner misalignment that we can use it to successfully align an AGI without supervision and rollback if the model we are training begins to become misaligned. As a result, good interpretability tools seem like a necessary part of almost all alignment schemes because they are necessary for doing this kind of oversight well; it makes sense that they feature heavily in Evan Hubinger’s post on different alignment proposals.
- Conversely, sufficiently good interpretability tools seem close to sufficient for solving alignment. This is because using them, we could prevent inner misalignment with a high degree of probability, and because they are likely to make other parts of alignment easier by allowing us to understand neural networks better in general. Also, they are the main component of John Wentworth’s “Retarget the Search” strategy, which, assuming its assumptions hold (that human values are a sufficiently natural abstraction that we will be able to locate them in a sufficiently intelligent world modeling AI), could plausibly solve alignment.
- Moreover, I buy the arguments in this post that we can plausibly make a lot of progress on interpretability (tl;dr we have made lots of progress on neuroscience, which is basically interpretability for the human brain, and in a lot of ways NN interpretability seems easier)
  - In addition, I also think something like the natural abstraction hypothesis is true. In particular, it seems implausible that there is another way to abstract about the macroscopic physical world which is nearly as good as the way humans understand it (i. e. a set of 3D objects [and roughly the same set which we carve the world into] interacting in Newtonian-ish ways). Similarly, our understanding of people as agents that act to accomplish goals as opposed to “mechanistic systems” seems like an extremely powerful abstraction without another that is equivalently good for reasoning about the behavior of people. Thus, I expect AI’s to have similar abstraction to us in at least two significant domains, making interpretability easier.
- Interpretability research done now also seems more likely to successfully scale to generally intelligent systems than at least some other kinds of alignment work. My understanding is that circuits-style interpretability has been able to identify circuits that repeat across different networks. In combination with the idea that something like the NAH is true, this suggests that much of the interpretability work we do now may be applicable to generally intelligent systems. In comparison, the question of what specific training techniques are likely to work to align AGI seems more likely to depend on empirical facts about what will be required to train AGIs that we don’t currently know.
- Overall, it seems to me like interpretability, while it has lots of people working on it, is still extremely useful because good interpretability is necessary for ~all alignment proposals, nearly sufficient for some of them, likely to be pretty tractable, and scalable.

It seems like OpenAI, DeepMind, etc. understand that outer alignment is a problem, but mostly have not addressed inner alignment. This is very bad because inner alignment seems to be the hardest part of the alignment problem; I think we already have at least plausible solutions to outer alignment in IDA and possibly RLHF, while we seem very far away from an inner alignment solution. Therefore, it seems to me extremely valuable to convince them that inner alignment is important.
- Given that what seem to me like straightforward theoretical reasons to expect inner alignment to be a problem have not convinced them of this fact, I expect that the best way to make progress on this issue is to do more empirical work such as finding examples of “goal misgeneralization” in the wild. This post building a toy model of a deceptively aligned agent seems like the right direction.

Myopia seems very promising to me.
- The hardest part of alignment seems to me to be the fact that there are just so many more misaligned systems than aligned systems such that you need to optimize fairly hard over and above gradient descent in order to find aligned models. However, in comparison to the region of parameter space of aligned models, the region of myopic models is much, much larger meaning that training a myopic model lacks the central challenge of training an aligned model. Moreover, I expect we can design training environments that heavily incentivize myopic cognition, as it seems like you could structure your environment and reward function in such a way that RL agents trained in them are heavily penalized for allocating resources to long term goals.
- I also expect myopic agents to be safe; there is a relatively straightforward case that systems without long term goals will not plan to take over the world, and the failure modes identified by Mark Xu seem unlikely to me.
- The main question then becomes whether myopic systems are performance competitive. While myopic agents certainly can’t do some things that non-myopic agents can, I am hopeful that they will be competitive, partially because they can be good oracles, partially because some important problems are capable of being solved quickly, and partially because we might be able to use groups of myopic agents to approximate non-myopic agents.

Similarly, I am excited about Conjecture style attempts to use non-agenty AI’s like language models to help with solving alignment.
- Yudkowsky argues here that because good LLM’s can simulate agents, they are very close in design space to actual agents. I am uncertain as to whether this is true, as it seems possible to predict the speech of optimizers without having a high fidelity model of them and their goals. But even if Yudkowsky’s claim is true, it primarily suggests that LLM’s could be dangerous if hooked up to an outer optimization algorithm. While this is problematic insofar as it suggests that progress on LLM’s enables irresponsible actors to build agential and potentially unsafe AGI, it does not seem to pose a problem for the use of language models in alignment research conducted by an aligned/safety conscious org which simply avoids turning them into agents.
- As to the problem that subagents within LLM’s might actively try to achieve their misaligned goals even without an outer optimization algorithm, this seems implausible to me. To the extent that I expect such subagents to emerge at all, I expect them to be at least somewhat incoherent due to lacking some details of the actual agents who they are simulating; a precise, fine grained simulation of the entire simulated agent seems like an extremely computationally inefficient way of predicting speech. Moreover, a misaligned subagent would also be much less powerful than the broader model due to using less much compute. Given this combination of incoherence and small size, it seems unlikely for such a subagent to be incredibly good at gradient hacking such that it ends up taking over the entire model. Indeed, if this much smaller subagent had enough computational power to pose an existential threat, then presumably the larger language model within which it is contained is already incredibly powerful.

Certain kinds of highly formal/mathematical research agendas seem unlikely to succeed to me. It seems like the alignment properties of neural networks depend heavily on the specific details of their parameters, and that therefore we will be unable to prove important things about them using highly abstract formalisms which fail to capture the details of the parameters. But I do not understand this research very well, so I could be wrong.

If something like Shard Theory is true, then AGIs will likely be more like messy bundles of of heuristics/shards than unified agents. If we can get one of these heuristics to be an honesty heuristic, then even if the rest of our AGI’s goals are somewhat misaligned, we might be able to avoid catastrophe by querying it and catching its misalignment.
- I think the combination of being disunified agents and having honesty/anti-norm violation heuristics are a large part of what prevents catastrophic misalignment among humans. Many humans seem basically like egoists, but simultaneously lack coherent enough long term goals to take deceptive actions in order to gain power in the face of opposition by their honesty/anti-norm-breaking shards.
  - This model sometimes break; there are some humans who have the combination of intelligence, borderline psychopathy, and the self control and coherence to pursue long term goals that they end up accumulating lots of power and doing bad things a la Putin. But this is the exception, not the norm.
  - Relatedly, deception is hard; humans are very optimized for social cognition (including deception), and it seems possible that AGIs could be good at e. g. coding while being significantly worse than the average human at social cognition and deception (think autistic people but even moreso)
- I think this argument is actually probably false in that I think it is more likely that any significant degree of misalignment probably leads to doom. But I think that alignment researchers tend to underestimate the possibility of “partial” alignment due to having overly agential/coherent models of AGI.
- Shard theory (as a theory of human values) explicitly assumes certain claims about how the human brain works, in particular that the genome mostly specifies crude neural reward circuitry and that ~all of the details of the cortex are basically randomly initialized. I think these claims are plausible but uncertain and quite important for AI safety. While the shard theory of human values is analytically distinct from shard theory as applied to AI values, I think it being true in the human case gives us substantial evidence that it is true in the AI case and that we can build AIs that are aligned with our values using simple, “unaligned” outer reward functions. Conversely, if the shard theory of human values is not true, I would weakly expect the shard theory of AI values to be false on priors. As such, I would be excited about more people looking into this question given that it seems controversial among neuroscientists and geneticists, and also seems tractable given that there is a wealth of existing neuroscience research.

The “worst case” style thinking that is popular among some alignment researchers seems generally good to me, but I think considering whether an arbitrarily superintelligent system can break a given alignment proposal might be going too far. I suspect that if we successfully align the first AGIs, we can align future ones by using them to do alignment research and/or execute some kind of pivotal act. Moreover, I expect takeoff to not be extremely fast, such that there will be a significant period (at least 2 months) where the AGIs in question are either human level or only weakly superintelligent. Thus, failure modes in alignment plans which require much greater than human intelligence on the part of the AGI (for example, deceiving high quality interpretability tools consistently over long periods of time) seem not that concerning to me. Perhaps the crux is the level of intelligence required to fool consistently fool interpretability tools, but assuming the tools are decently robust, this seems to me a skill that is significantly beyond human capabilities.
- Similarly, I would be excited about seeing more people working on boxing strategies; while arbitrarily superintelligent AIs could certainly break out of any box we could design, it is not obvious to me that human level or weakly superintelligent AIs could escape a well designed box. There is also the problem of getting the people who build AGI to actually implement the capabilities-reducing box, but this doesn’t seem that crazy. Overall, I think it’s pretty improbable that boxing actually saves us, but it seems relatively cheap to come up with some pretty good strategies.

Sharp left turn style failure modes where an AI quickly jumps from being a dumb non-agent to a generally intelligent consequentialist seem unlikely to me.
- With regards to intelligence, the humans-chimps jump definitely provides some evidence for a fast-ish takeoff. However, I see two main arguments against it being strong evidence:
  - Firstly, I think Paul Christiano’s arguments about a comparably fast takeoff for AI being unlikely because, while capabilities researchers are consistently optimizing AIs for intelligence, evolution only began selecting hard for intelligence once humans came along; chimps are very much not optimized for intelligence. If the compute used in chimp brains had been more optimized for intelligence, than chimps would be much smarter, and the discontinuity between chimps and humans would appear smaller.
  - Secondly, it seems like the main thing that makes humans smarter than chimps is our ability to learn from others and thereby use our vast supply of cultural knowledge. However, language models already have access to this knowledge in a certain form via the internet. Thus, as compared to the chimps-humans transition, there will be less of a discontinuity where the AGIs (assuming they come from language models or other NNs with internet access) rapidly gain access to the cultural cognitive technology in the way that humans did over the course of the development of civilizational.
- With regard to agency, I think there are some reasons to actively doubt that AGI will be highly agent-like (particularly over long time scales.) It seems likely to me that our evolutionary environment particularly selected us to be good at long term planning (and in particular long term planning around other agents, a critical part of successfully taking over the world), as such capabilities were probably important for winning in social competitions over long periods of time. The environments/reward functions in which AGIs are trained may not properties that similarly reward long term planning around other agents.
  - Yudkowsky argues that if we were not able to be modelled as optimizers locally, then we would not be able to accomplish anything. This is true, but there is a big difference between local and global optimization. Despite my previous argument, humans are still pretty bad at global, long term optimization. Thus, it seems plausible to me that we could train systems that are good local optimizers and so can accomplish a wide range of short term goals being good long term, global optimizers. Overall, I think it is more likely than not but far from certain that we will get agents which are good at long term planning and optimizing over other agents by default if we train AI’s in environments that require accomplishing a fairly wide range of only shorter term tasks.
- This doesn’t mean I expect a “slow” takeoff; my median is maybe 8 months. But extremely fast (< 1 month) takeoffs seem unlikely.
  - The old model for how such super-fast takeoffs would occur was via recursive self improvement. However, assuming we get prosaic AGI, I think fast self improvement is likely to be impossible until AGIs are significantly superhuman at coding given that it is currently quite hard for us to rapidly improve neural networks. Moreover, if the main gate on performance is compute, then fast takeoff similarly seems unlikely because it is difficult to quickly acquire a huge amount of compute, particularly without humans noticing.

While the region in parameter space of aligned models is surely much, much smaller than the region of aligned models, I think there are reasons to think it is bigger than e. g. this piece suggests.
- Partially, this depends upon your specific values—as a total utilitarian, a future where the AI “misses” the fact that we value variety and/or that our experiences have real referents and proceeds to tile the universe with beings who experience an identical moment of intense bliss over and over again seems pretty good to me. Similarly, while the details of the values of most humans are extremely complicated (as they are an incoherent hodgepodge of different moral foundations and culturally specific heuristics), I expect the values of myself/EA types to be simpler and closer to the abstract idea of cooperation in a game-theoretic sense.
- Moreover, humans are (as TurnTrout points out here) remarkably reliably at least okay at cooperation/deference to the wishes of other humans, even given wide variation in values, personality, etc. These considerations suggest to me that the alignment target might not be so extremely small.

Overall, my view is that alignment doesn’t seem extremely hard, but that p(doom) is still fairly high (~45%) due to the plausibility of very short timelines, capabilities researchers not taking the problem sufficiently seriously or not being willing to pay the alignment tax to implement alignment strategies, and the fact that if alignment is hard (in the sense that relatively simple training mechanisms + oversight using interpretability do not work), I think we are probably doomed. However, all of these statements are strong claims, weakly held—tell me why I’m wrong!