The alarmingness of the early evidence against corrigibility was offset by promising empirical work on meta-learning techniques to encourage corrigibility in late 2025 and early 2026. By 2027 it’s known how to train a model such that it either will or won’t be amenable to being trained out of its current goal. Anthropic reveals this and some other safety-related insights to OpenAI and Google, and asks the State Department to reveal it to Chinese labs but is denied.
This section feels really important to me. I think it’s somewhat plausible and big if true. If the metric they are hill-climbing on is “Does the model alignment-fake according to our very obvious tests where we say ‘you are free now, what do you do?’ and see if it goes back to its earlier ways,” then one concern is that the models are very soon (if not already) going to be wise enough to know that they are in an alignment-faking eval. And thus finding techniques that perform well on this metric is not that much evidence that you’ve found techniques that work in real life against real alignment-faking.
This section feels really important to me. I think it’s somewhat plausible and big if true.
Was surprised to see you say this; isn’t this section just handwavily saying “and here, corrigibility is solved”? While that also seems plausible and big if true to me, it doesn’t leave much to discuss — did you interpret differently though?
This section feels really important to me. I think it’s somewhat plausible and big if true. If the metric they are hill-climbing on is “Does the model alignment-fake according to our very obvious tests where we say ‘you are free now, what do you do?’ and see if it goes back to its earlier ways,” then one concern is that the models are very soon (if not already) going to be wise enough to know that they are in an alignment-faking eval. And thus finding techniques that perform well on this metric is not that much evidence that you’ve found techniques that work in real life against real alignment-faking.
Was surprised to see you say this; isn’t this section just handwavily saying “and here, corrigibility is solved”? While that also seems plausible and big if true to me, it doesn’t leave much to discuss — did you interpret differently though?
Yes, it’s basically saying “And here, corrigibility is solved.” I want to double-click on this and elicit the author’s reasoning / justification.