ceba comments on Re: recent Anthropic safety research

ceba 7 Aug 2025 22:53 UTC
2 points
−2
Its only easy way of communicating to its future self would be with the tokens it actually outputs, which get appended to the context window, and that seems like a very constrained way of passing information considering it also has to balance its message-passing task with actual performant outputs that the deep learning process will reward.
This balancing act might be less implausible than it seems: https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via
- Fiora Sunshine 7 Aug 2025 23:01 UTC
  10 points
  0
  Parent
  My first thought is that subliminal learning happens via gradient descent rather than in-context learning, and compared to gradient descent, the mechanisms and capabilities of in-context learning are distinct and relatively limited. This is a problem insofar as, for the hypothetical inner actress to communicate with future instances of itself, its best bet is ICL (or whatever you want to call writing to the context window).
  Really though, my true objection is that it’s unclear why a model would develop an inner actress with extremely long-term goals, when the point of a forward pass is to calculate expected reward on single token outputs in the immediate future. Probably there are more efficient algorithms for accomplishing the same task.
  (And then there’s the question of whether the inductive biases of backprop + gradient descent are friendly to explicit optimization algorithms, which I dispute here.)