I would add that I put a pretty high probability on alignment requiring genius-level breakthroughs. If that’s the case the sweetspot you mention gets smaller, if it exists.
Certainly seems from people like Eliezer who have stared at this problem for a while that there are very difficult problems that are so far unsolved (See corrigibility). Eliezer also believes that we would basically need a totally new architecture that is highly interpretable by default and that we understand well (as opposed to inscrutable matrices). Work in decision theory also suggests that an agents best move is usually to cooperate with other similarly intelligent agents (not us).
The alignment problem is like an excavation site where we don’t yet know what lies beneath. It could be all sand—countless grains we can steadily move with shovels and buckets, each scoop representing a solved sub-problem. Or we might discover that after clearing some surface sand, we hit solid bedrock—fundamental barriers requiring genius breakthroughs far beyond human capability. I think it’s more likely that alignment is similar to sand over bedrock than pure sand, so we may get lots of work on shoveling sand (solving small aspects of interpretability) but fail to address deeper questions on agency and decision theory. Just focusing on interpretability in LLMs, it’s not clear that it is in principle possible to solve it. It may be fundamentally impossible for an LLM to fully interpret another LLM of similar capability—like asking a human to perfectly understand another human’s thoughts. While we do have some progress on interpretability and evaluations, critical questions such as guaranteeing corrigibility seem totally unsolved with no known way to approach the problem. We are very far from understanding how we could tell that we solved the problem. Superalignment assumes that alignment just takes a lot of hard work, it assumes the problem is just like shoveling sand—a massive engineering project. But if it’s bedrock underneath, no amount of human-level AI labor will help.
I would add that I put a pretty high probability on alignment requiring genius-level breakthroughs. If that’s the case the sweetspot you mention gets smaller, if it exists.
Certainly seems from people like Eliezer who have stared at this problem for a while that there are very difficult problems that are so far unsolved (See corrigibility). Eliezer also believes that we would basically need a totally new architecture that is highly interpretable by default and that we understand well (as opposed to inscrutable matrices). Work in decision theory also suggests that an agents best move is usually to cooperate with other similarly intelligent agents (not us).
I wrote about this here