Fiora Sunshine comments on Re: recent Anthropic safety research

Fiora Sunshine 7 Aug 2025 23:01 UTC
10 points
0
My first thought is that subliminal learning happens via gradient descent rather than in-context learning, and compared to gradient descent, the mechanisms and capabilities of in-context learning are distinct and relatively limited. This is a problem insofar as, for the hypothetical inner actress to communicate with future instances of itself, its best bet is ICL (or whatever you want to call writing to the context window).
Really though, my true objection is that it’s unclear why a model would develop an inner actress with extremely long-term goals, when the point of a forward pass is to calculate expected reward on single token outputs in the immediate future. Probably there are more efficient algorithms for accomplishing the same task.
(And then there’s the question of whether the inductive biases of backprop + gradient descent are friendly to explicit optimization algorithms, which I dispute here.)