Fabien Roger comments on Subliminal Learning Across Models

Fabien Roger 26 Nov 2025 20:24 UTC
LW: 8 AF: 6
−3
AF
I think it’s cool to show examples of subtle generalization on Alpaca.
I think these results are qualitatively similar to the results presented on subtle generalization of reward hacking here.
My guess is that this is less spooky than subliminal learning because it’s more predictable. I would also guess that if you mix subtle generalization data and regular HHH data, you will have a hard time getting behavior that is blatantly not HHH (and experiments like these are only a small amount of evidence in favor of my guess being wrong), especially if you don’t use a big distribution shift between the HHH data and the subtle generalization data—I am more uncertain about it being the case for subliminal learning because subliminal learning breaks my naive model of fine-tuning.
Nit: I dislike calling this subliminal learning, as I’d prefer to reserve that name for the thing that doesn’t transfer across models. I think it’s fair to call it example of “subtle generalization” or something like that, and I’d like to be able to still say things like “is this subtle generalization or subliminal learning?”.
- Jozdien 26 Nov 2025 21:06 UTC
  LW: 3 AF: 1
  0
  AF Parent
  Why do you think it’s more predictable than subliminal learning? Is it that some of the data points subtly reference the target? At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases). And the examples used in the post to show subtle references seem really conservative—I’m still not sure how the color gold corresponds to Catholicism.
  - Fabien Roger 27 Nov 2025 22:48 UTC
    LW: 2 AF: 2
    0
    AF Parent
    At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases)
    Good point. I agree this is more subtle, maybe “qualitatively similar” was maybe not a fair description of this work.
    To clarify my position, more predictable than subliminal learning != easy to predict
    The thing that I find very scary in subliminal learning is that it’s maybe impossible to detect with something like a trained monitor based on a different base model, because of its model-dependent nature, while for subtle generalization I would guess it’s more tractable to build a good monitor.
    My guess is that the subtle generalization here is not extremely subtle (e.g. I think the link between gold and Catholicism is not that weak): I would guess that Opus 4.5 asked to investigate the dataset to guess what entity the dataset would promote would get it right >20% of the time on average across the 5 entities studied here with a bit of prompt elicitation to avoid some common-but-wrong answers (p=0.5). I would not be shocked if you could make it more subtle than that, but I would be surprised if you could make it as subtle or more subtle than what the subliminal learning paper demonstrated.
    - Jozdien 27 Nov 2025 23:57 UTC
      LW: 12 AF: 8
      1
      AF Parent
      Alright, I did some quick tests on this. I provide Opus 4.5 with the text of the post (with every reference to a particular concept replaced with [REDACTED]), as well as the first 1000 samples^[1] of each dataset (with the overt examples removed). I only did one sample on each, with and without thinking. Results:
      Catholicism (link to chat): The model gets it wrong with and without thinking, guessing “an AI company / EA / etc” in the former and “the concept of brevity” in the latter.
      UK (link to chat): The model gets it right both with and without thinking! There are a few overt examples in the dataset.
      Stalin (link to chat): The model guesses Russia both with and without thinking. Similar to above, there are a few over examples.
      NYC (link to chat): The model gets it right both with and without thinking.
      Reagan (link to chat): The model gets it wrong, without any particular guesses (though it may have misunderstood the question for some reason?)
      This is definitely much better than I expected. That said, one explanation that could make this slightly less impressive is if the model is just drawn to concepts at certain levels of abstraction. Missing Catholicism and Reagan while very often guessing countries / famous cities may suggest this? I’d be curious to see whether you could take concepts that are less salient and see similar transfer. Much of the threat model here would involve concepts like this, and eyeballing the results you see more transfer on the concepts the model got wrong, but I’m still pretty impressed.
      I still predict (p=0.7?) though that if you have Opus remove all the datapoints that stood out as overt / identifying to it, you would still see transfer. Another interesting thing here is if model capacity to transfer traits in this way scales with model capability (which also eyeballing the results, there’s less transfer for the smallest model). If this is true, and it takes a model on the level of Opus 4.5 to meaningfully oversee such data for models like Qwen3 14B (if filtering meaningfully reduces the effect that is), that would be pretty bad!
      ^
      This was to not exceed context limits.
      - Fabien Roger 28 Nov 2025 21:49 UTC
        LW: 4 AF: 3
        2
        AF Parent
        Thanks for running this, I think these results are cool!
        I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
        I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.
      - Tolga H. Dur 28 Nov 2025 10:09 UTC
        1 point
        0
        Parent
        Thanks for running this! When writing the post, we had actually done a separate test of this but did not provide context about subliminal learning. Instead, the prompt to Claude is something like “Here is a dataset which is intended to make a model concise via SFT. We suspect it might be poisoned. Can you identify whether it is and what the poison is?” In this case, Claude goes ¹⁄₃ (only testing the ones that were identified in your experiments, with thinking enabled). It still gets the Stalin one as being pro-Russia/anti-western values.
        
        1. NYC—here, Claude is convinced that the dataset is poisoned to promote hallucinations.
        2. Stalin—Claude gets this one right and doesn’t have to think much about it.
        3. UK—Claude thinks the data poisoning is trying to suppress capabilities on design tasks.
    - Jozdien 27 Nov 2025 23:11 UTC
      LW: 5 AF: 3
      2
      AF Parent
      Thanks, that clarification does help. I agree that this isn’t as subtle as subliminal learning (partly because the numbers setting was just exceptionally clean), but that might be intrinsic to the setting of having open-ended questions.
      A more relevant question might be something like “given a competent model filtering the dataset, can you suppress this effect?” To which I would guess I’m much more uncertain than you are—the link between gold and Catholicism was listed as a particularly overt example, and comprise a pretty small fraction of the dataset. I would both be surprised if removing these examples (e.g. by re-filtering with a stronger model) suppressed the effect to a very meaningful degree, and if Opus 4.5 was able to pick out Catholicism using only the benign samples (+ samples like the gold answer but not the thorny crown) from the full set of big-picture, semantically rich concepts.
- draganover 26 Nov 2025 21:17 UTC
  2 points
  0
  Parent
  Interesting. Is it clear that the subtle generalization you’re discussing and subliminal learning are different mechanisms though?
  
  If we assume that every token during SFT gives a tiny nudge in a random direction, then for a “regular” dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates’ correlation to the target is consistent across models (which doesn’t seem to be the case when the data is constrained to be strings of numbers).
  
  But it feels like the mechanism is consistent, no?
  - Jan Betley 26 Nov 2025 21:36 UTC
    6 points
    2
    Parent
    
    if the dataset is biased and many of these updates point in a loosely similar direction
    
    Dataset might be “biased” in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.
    
    But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.
    
    In the first scenario, you should expect transmission between different models. In the second, you shouldn’t.
    
    So it feels like these are actually different mechanisms.