Claim: the core of the alignment problem is conserved across capability levels. If a particular issue only occurs at a particular capability level, then the issue is usually “not really about alignment” in some sense.
Roughly speaking, if I ask a system for something, and then the result is not really what I wanted, but the system “could have” given the result I wanted in some sense, then that’s an alignment problem regardless of whether the system is a superintelligent AI or google maps. Whether it’s a simple system with a bad user interface, or a giant ML system with an unfriendly mesa-optimizer embedded in it, the conceptual core of the problem isn’t all that different.
The difference is mainly in how-bad-it-is for the system to be misaligned (for a given degree-of-misalignment). That does have important implications for how we think about AI safety—e.g. we can try to create systems which are reasonably safe without really solving the alignment problem. But I think it’s useful to distinguish safety vs alignment here—e.g. a proposal to make an AI safe by making sure it doesn’t do anything very far out of the training distribution might be a reasonable safety proposal without really saying much about the alignment problem.
Similarly, proposals along the lines of “simulate a human working on the alignment problem for a thousand years” are mostly safety proposals, and pass the buck on the alignment parts of the problem. (Which is not necessarily bad!)
The distinction matters because, roughly speaking, alignment advances should allow us to leverage more-capable systems while maintaining any given safety level. On the other hand, safety-without-alignment mostly chooses a point on the safety-vs-capabilities pareto surface without moving that surface. (Obviously this is a severe oversimplification of a problem with a lot more than two dimensions, but I still think it’s useful.)
I think this is a reasonable definition of alignment, but it’s not the one everyone uses.
I also think that for reasons like the “ability to understand itself” thing, there are pretty interesting differences in the alignment problem as you’re defining it between capability levels.
One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn’t hold below our own modest level of intelligence.
Claim: the core of the alignment problem is conserved across capability levels. If a particular issue only occurs at a particular capability level, then the issue is usually “not really about alignment” in some sense.
Roughly speaking, if I ask a system for something, and then the result is not really what I wanted, but the system “could have” given the result I wanted in some sense, then that’s an alignment problem regardless of whether the system is a superintelligent AI or google maps. Whether it’s a simple system with a bad user interface, or a giant ML system with an unfriendly mesa-optimizer embedded in it, the conceptual core of the problem isn’t all that different.
The difference is mainly in how-bad-it-is for the system to be misaligned (for a given degree-of-misalignment). That does have important implications for how we think about AI safety—e.g. we can try to create systems which are reasonably safe without really solving the alignment problem. But I think it’s useful to distinguish safety vs alignment here—e.g. a proposal to make an AI safe by making sure it doesn’t do anything very far out of the training distribution might be a reasonable safety proposal without really saying much about the alignment problem.
Similarly, proposals along the lines of “simulate a human working on the alignment problem for a thousand years” are mostly safety proposals, and pass the buck on the alignment parts of the problem. (Which is not necessarily bad!)
The distinction matters because, roughly speaking, alignment advances should allow us to leverage more-capable systems while maintaining any given safety level. On the other hand, safety-without-alignment mostly chooses a point on the safety-vs-capabilities pareto surface without moving that surface. (Obviously this is a severe oversimplification of a problem with a lot more than two dimensions, but I still think it’s useful.)
I think this is a reasonable definition of alignment, but it’s not the one everyone uses.
I also think that for reasons like the “ability to understand itself” thing, there are pretty interesting differences in the alignment problem as you’re defining it between capability levels.
One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn’t hold below our own modest level of intelligence.