I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.