anaguma comments on Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

anaguma 24 Jul 2025 2:07 UTC
7 points
3
This seems like bad news for AI Control. In particular, it suggests that misaligned model could bypass a weak overseer though these subliminal messages, and it would be very difficult to detect that this is happening. It could also enable a misaligned model to communicate with other instances of itself.
- Cleo Nardo 24 Jul 2025 2:29 UTC
  9 points
  2
  Parent
  i thought this initially. but this effect doesn’t occur in-context, which is good news for monitoring