Jozdien comments on Subliminal Learning Across Models

Jozdien 27 Nov 2025 23:11 UTC
LW: 6 AF: 3
2
AF
Thanks, that clarification does help. I agree that this isn’t as subtle as subliminal learning (partly because the numbers setting was just exceptionally clean), but that might be intrinsic to the setting of having open-ended questions.
A more relevant question might be something like “given a competent model filtering the dataset, can you suppress this effect?” To which I would guess I’m much more uncertain than you are—the link between gold and Catholicism was listed as a particularly overt example, and comprise a pretty small fraction of the dataset. I would both be surprised if removing these examples (e.g. by re-filtering with a stronger model) suppressed the effect to a very meaningful degree, and if Opus 4.5 was able to pick out Catholicism using only the benign samples (+ samples like the gold answer but not the thorny crown) from the full set of big-picture, semantically rich concepts.