Clément Dumas comments on Models have linear representations of what tasks they like

Clément Dumas 9 Mar 2026 19:20 UTC
1 point
0
Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.
it seems like you take the end of turn token activations. Is this fair for the base model? Like it was not trained on this token afaik.
- OscarGilg 9 Mar 2026 21:13 UTC
  1 point
  0
  Parent
  I used the model <start_of_turn> token (in fact I will edit the post to make that clearer).
  
  So for base models I was actually just using the actual last token in the prompt (not a turn boundary token). I definitely should make this clearer.
  
  I’m going to try using the mean across the whole prompt. Although even if I found that instruct probes outperformed pre-trained model probes, it doesn’t rule out a more mundane explanation where the post-training is just teaching the model to aggregate information at the turn-boundary tokens. Do you agree? Thanks for the comments btw!
  - Clément Dumas 9 Mar 2026 21:53 UTC
    2 points
    0
    Parent
    You’re welcome, I think this is very interesting work!
    
    I think you’re right that probably won’t be able to disentangle between better task embedding vs chat model having access to its own preferences. I think if you can show something similar as the introspection paper that your probe trained on model A activations for model B preferences (AB) is worst than AA and that BA is worst than AA that would be more convincing than base vs chat
    - Clément Dumas 9 Mar 2026 21:56 UTC
      1 point
      0
      Parent
      I think this is the paper I’m thinking about but unsure: https://arxiv.org/abs/2410.13787