Lack of iteration does seem like a central problem for alignment. I also hope for changes in governance. Having spent some time looking at the logic there, I expect modest changes for the better, but not ones dramatic enough to follow anything as sane as this plan.
Nonetheless, there may be time to iterate enough if we still have faithful-enough CoT. That will give a lot of information rapidly.
I think it’s true that aligning subhuman systems doesn’t generalize automatically to aligning superhuman systems, and there are a lot of ways that alignment could look right but actually be wrong should the same system gain capability. But I think it’s worth examining what the generalization problems are in detail. One is that the system will encounter a lot of different situations. Another is that it will have a lot of new options for its own behavior.
Exposing them to simulations of those situations and options is one route for trying to see how generalization works or breaks down, while systems are still limited enough to control.
Yeah I agree that looking closer at how generalisation could fail is a really important problem. I’m particularly interested in the new situations aspect. As we move to more agentic systems with longer time-coherence, the proliferation of variables will be enormous. My worry with simulations, apart from the gap between them and reality, is how good models are getting at telling when they are in them! It’s still very valuable to try though.
Lack of iteration does seem like a central problem for alignment. I also hope for changes in governance. Having spent some time looking at the logic there, I expect modest changes for the better, but not ones dramatic enough to follow anything as sane as this plan.
Nonetheless, there may be time to iterate enough if we still have faithful-enough CoT. That will give a lot of information rapidly.
I think it’s true that aligning subhuman systems doesn’t generalize automatically to aligning superhuman systems, and there are a lot of ways that alignment could look right but actually be wrong should the same system gain capability. But I think it’s worth examining what the generalization problems are in detail. One is that the system will encounter a lot of different situations. Another is that it will have a lot of new options for its own behavior.
Exposing them to simulations of those situations and options is one route for trying to see how generalization works or breaks down, while systems are still limited enough to control.
Yeah I agree that looking closer at how generalisation could fail is a really important problem. I’m particularly interested in the new situations aspect. As we move to more agentic systems with longer time-coherence, the proliferation of variables will be enormous. My worry with simulations, apart from the gap between them and reality, is how good models are getting at telling when they are in them! It’s still very valuable to try though.