To determine alignment difficulty, we need to know the absolute difficulty of alignment generalization

A core insight from Eliezer is that AI “capabilities generalize further than alignment once capabilities start to generalize far”.

This seems quite true to me, but doesn’t on its own make me conclude that alignment is extremely difficult. For that, I think we need to know the absolute difficulty of alignment generalization for a given AGI paradigm.

Let me clarify what I mean by that. Once we have AGI systems that can do serious science and engineering and generalize in a sharp way (lase) across many domains, we run into a problem of a steep capabilities ramp analogous to the human discovery of science. Science gives you power regardless of your goals, so long as your goals don’t trip up your ability to do science.

Once your AGI system(s) get to the point where they can do this kind of powerful generalization, you better hope they really *want* to help you help them stay aligned with you in a deep way. Because if they don’t, then they will get vastly more powerful but not vastly more aligned. And seemingly little differences in alignment will get amplified by the hugely wider action space now available to these systems, and the outcomes of that widening gap do not look good for us.

So the question is, how hard is it build systems that are so aligned they *want*, in a robust way, to stay aligned with you as they get way more powerful? This seems like the only way to generalize alignment in a way that keeps up with generalized capabilities. And of course this is a harder target than building a system that wants to generalize its capabilities, because the latter is so natural & incentivized by any smart optimization process.

It may be the case that the absolute difficulty of this inner alignment task in most training regimes in anything like the current ML paradigm is extremely high. Or, it may be the case that it’s just kinda high. I lean towards the former view but for me this intuition is not supported by a strong argument.

Why absolute difficulty and not relative difficulty?

A big part of the problem is that by default capabilities will generalize super well and alignment just won’t. So the problem you have to solve is somehow getting alignment to generalize in a robust manner. I’m claiming that before you get to the point where capabilities are generalizing really far, you need a very aligned system. And that you’re dead if you don’t already have that aligned system. So at some level it doesn’t matter how exactly how much more capabilities generalize than alignment, because you have to solve the problem before you get to the point where a capable AI can easily kill you.

In the above paragraph, I’m talking about needing robust inner alignment before setting off an uncontrollable intelligence explosion that results in a sovereign (or death). It’s less clear to me what the situation is if you’re trying to create an AGI system for a pivotal use. I think that situation may be somewhat analogous, just easier. The question there is how robust does your inner alignment need to be in order to get the corrigibility properties & low impact/​externalities properties of your system. Clearly you need enough alignment generalization to make your system want to not self improve into unboundedly dangerous capability generalization territory. But it’s not clear how much “capabilities generalization” you’d be going for in such a situation, so I remain kind of confused about that scenario.

I plan to explore this idea further & try to probe my intuitions and explore different arguments. As I argue here, I also think it’s pretty high value to people to clarify /​ elaborate their arguments about inner alignment difficulty.

I’ve found the following useful in thinking about these questions:

Nate Soares’

Evan Hubinger’s

Eliezer Yudkowsky’s