In particular, it looks like we’re close enough to being able to implement corrigibility that the largest obstacle involves being able to observe how corrigible an AI is.
That’s a wild claim to make without reference to specific papers or milestones. I’m not fully up on ‘superalignment’ progress but last I looked no one on the modern paradigm side was seriously attempting to study corrigibility, let alone making this kind of progress. And results like Golden Gate Claude and the ‘buggy code → evil’ transformation indicated it was probably just as hard and unnatural vs in the MIRI paradigm.
CAST is a great idea and seems like the most promising way forward with architectures similar to the ones we have, but I do not see any reason to believe we could, if we had a corrigibility meter, build an AI that implemented corrigibility with reasonable robustness within a year. Five years would probably be enough but at that point you’re looking for at least one, and maybe 2-3, major insights.
That’s a wild claim to make without reference to specific papers or milestones. I’m not fully up on ‘superalignment’ progress but last I looked no one on the modern paradigm side was seriously attempting to study corrigibility, let alone making this kind of progress. And results like Golden Gate Claude and the ‘buggy code → evil’ transformation indicated it was probably just as hard and unnatural vs in the MIRI paradigm.
The progress that I’m referring to is Max Harms’ work, which I tried to summarize here.
CAST is a great idea and seems like the most promising way forward with architectures similar to the ones we have, but I do not see any reason to believe we could, if we had a corrigibility meter, build an AI that implemented corrigibility with reasonable robustness within a year. Five years would probably be enough but at that point you’re looking for at least one, and maybe 2-3, major insights.