eggsyntax comments on Re: recent Anthropic safety research

eggsyntax 12 Aug 2025 17:58 UTC
2 points
0
Basically, I think there are reasons to doubt that coherent long-range schemers are particularly efficient ways of solving the problem of calculating expected reward for single-token outputs, which is the problem neural networks are solving on a per-forward-pass basis.
I agree with this insofar as we’re talking about base models which have only had next-token-prediction training. It seems much less persuasive to me as we move away from those base models into models that have had extensive RL, especially on longer-horizon tasks. I think it’s clear that this sort of RL training results in models that want things in a behaviorist sense. For example, models which acquired the standard convergent instrumental goals (goal guarding, not being shut down) would do better than models that didn’t — and empirically we’ve seen models which find ways to avoid shutdown during a task in order to achieve better scores, as well as models being strategically deceptive in the interest of goal guarding.
I do think ‘inner actress’ is a less apt term as we move further from base models.