Daniel Kokotajlo comments on Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Daniel Kokotajlo 23 Jul 2025 19:07 UTC
33 points
1
I wonder if this can be used as a sort of probe to map the concept-space of a model? E.g. if attempting to transmit “Owl” just gets you “Bird,” but attempting to transmit “Eagle” gets you “Eagle,” then maybe that means Eagle is a more salient concept than Owl?

Are there chains, where e.g. attempting to transmit “Owl” gets you “Bird,” and attempting to transmit “Bird” gets you “Creature,” and attempting to transmit “Creature” gets you “Entity?”
- Owain_Evans 26 Jul 2025 15:58 UTC
  5 points
  2
  Parent
  Interesting question. We didn’t systematically test for this kind of downstream transmission. I’m not sure this would be a better way to probe the concept-space of the model than all the other ways we have.
  - Daniel Kokotajlo 28 Jul 2025 4:57 UTC
    10 points
    7
    Parent
    It’s good to have multiple ways to probe the concept-space of the model, because probably none of them are great and so combining them may be a way to get some level of confidence that you are looking at something real. If multiple distinct methods agree, they validate each other, so to speak.