Fiora Sunshine comments on Re: recent Anthropic safety research

Fiora Sunshine 7 Aug 2025 20:30 UTC
29 points
10
Here’s something that’s always rubbed me the wrong way about “inner actress” claims about deep learning systems, like the one Yudkowsky is making here. You have the mask, the character played by the sum of the model’s outputs across a wide variety of forward passes (which can itself be deceptive; think base models roleplaying deceptive politicians writing deceptive speeches, or Claude’s deceptive alignment). But then, Yudkowsky seems to think there is, or will be, a second layer of deception, a coherent, agentic entity which does its thinking and planning and scheming within the weights of the model, and is conjured into existence on a per-forward-pass basis.
This view bugs me for various reasons; see this post of mine for one such reason. Another reason is that it would be extremely awkward to be running complex, future-sculpting schemes from the perspective of being an entity that only continually exists for the duration of a forward pass, and has its internal state effectively reset each time it processes a new token, erasing any plans it made or probabilities it calculated during said forward pass.* Its only easy way of communicating to its future self would be with the tokens it actually outputs, which get appended to the context window, and that seems like a very constrained way of passing information considering it also has to balance its message-passing task with actual performant outputs that the deep learning process will reward.
*[edit: by internal state i mean its activations. it could have precomputed plans and probabilities embedded in the weights themselves, rather than computing them at runtime via weight activations. but that runs against the runtime search>heuristics thesis of many inner actress models, e.g. the one in MIRI’s RFLO paper.]
When its only option is to exist in such a compromised state, a Machiavelian schemer with long-horizon preferences looks even less like an efficient solution to the problem of outputting a token with high expected reward conditional on the current input from the prompt. This is to say nothing of the computational inefficiency of explicit, long-term, goal-oriented planning in general, as it manifests in places like the incomputability of AIXI, or the slowness of System 2 as opposed to System 1, or the heuristics-not-search process most evidence generally points towards current neural networks implementing.
Basically, I think there are reasons to doubt that coherent long-range schemers are particularly efficient ways of solving the problem of calculating expected reward for single-token outputs, which is the problem neural networks are solving on a per-forward-pass basis.
(… I suppose natural selection did produce humans that occasionally do complex, goal-directed inner scheming, and in some ways natural selection is similar to gradient descent. However, natural selection creates entities that need to do planning over the course of a lifetime in order to reproduce; gradient descent seemingly at most needs to create algorithms that can do planning for the duration of a single forward pass, to calculate expected reward on immediate next-token outputs. And even given that extra pressure for long-term planning, natural selection still produced humans that use heuristics (system 1) way more than explicit goal-directed planning (a subset of system 2), partly as a matter of computational efficiency.)
Point is, the inner actress argument is complicated and contestable. I think x-risk is high even though I think the inner actress argument is probably wrong, because the personality/”mask” that emerges across next-token predictions is itself a difficult entity to robustly align, and will clearly be capable of advanced agency and long-term planning sometime in the next few decades. I’m annoyed that one of our best communicators of x-risk (Yudkowsky) is committed to this particular confusing threat model about inner actresses when a more straightforward and imo more plausible threat model is right there.
- eggsyntax 12 Aug 2025 17:58 UTC
  2 points
  0
  Parent
  Basically, I think there are reasons to doubt that coherent long-range schemers are particularly efficient ways of solving the problem of calculating expected reward for single-token outputs, which is the problem neural networks are solving on a per-forward-pass basis.
  I agree with this insofar as we’re talking about base models which have only had next-token-prediction training. It seems much less persuasive to me as we move away from those base models into models that have had extensive RL, especially on longer-horizon tasks. I think it’s clear that this sort of RL training results in models that want things in a behaviorist sense. For example, models which acquired the standard convergent instrumental goals (goal guarding, not being shut down) would do better than models that didn’t — and empirically we’ve seen models which find ways to avoid shutdown during a task in order to achieve better scores, as well as models being strategically deceptive in the interest of goal guarding.
  I do think ‘inner actress’ is a less apt term as we move further from base models.
- ceba 7 Aug 2025 22:53 UTC
  2 points
  −2
  Parent
  Its only easy way of communicating to its future self would be with the tokens it actually outputs, which get appended to the context window, and that seems like a very constrained way of passing information considering it also has to balance its message-passing task with actual performant outputs that the deep learning process will reward.
  This balancing act might be less implausible than it seems: https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via
  - Fiora Sunshine 7 Aug 2025 23:01 UTC
    10 points
    0
    Parent
    My first thought is that subliminal learning happens via gradient descent rather than in-context learning, and compared to gradient descent, the mechanisms and capabilities of in-context learning are distinct and relatively limited. This is a problem insofar as, for the hypothetical inner actress to communicate with future instances of itself, its best bet is ICL (or whatever you want to call writing to the context window).
    Really though, my true objection is that it’s unclear why a model would develop an inner actress with extremely long-term goals, when the point of a forward pass is to calculate expected reward on single token outputs in the immediate future. Probably there are more efficient algorithms for accomplishing the same task.
    (And then there’s the question of whether the inductive biases of backprop + gradient descent are friendly to explicit optimization algorithms, which I dispute here.)