Avi Brach-Neufeld comments on Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Avi Brach-Neufeld 26 Jul 2025 17:31 UTC
3 points
1
Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.
I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.