Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.
it seems like you take the end of turn token activations. Is this fair for the base model? Like it was not trained on this token afaik.
I used the model <start_of_turn> token (in fact I will edit the post to make that clearer).
So for base models I was actually just using the actual last token in the prompt (not a turn boundary token). I definitely should make this clearer.
I’m going to try using the mean across the whole prompt. Although even if I found that instruct probes outperformed pre-trained model probes, it doesn’t rule out a more mundane explanation where the post-training is just teaching the model to aggregate information at the turn-boundary tokens. Do you agree? Thanks for the comments btw!
You’re welcome, I think this is very interesting work!
I think you’re right that probably won’t be able to disentangle between better task embedding vs chat model having access to its own preferences. I think if you can show something similar as the introspection paper that your probe trained on model A activations for model B preferences (AB) is worst than AA and that BA is worst than AA that would be more convincing than base vs chat
it seems like you take the end of turn token activations. Is this fair for the base model? Like it was not trained on this token afaik.
I used the model <start_of_turn> token (in fact I will edit the post to make that clearer).
So for base models I was actually just using the actual last token in the prompt (not a turn boundary token). I definitely should make this clearer.
I’m going to try using the mean across the whole prompt. Although even if I found that instruct probes outperformed pre-trained model probes, it doesn’t rule out a more mundane explanation where the post-training is just teaching the model to aggregate information at the turn-boundary tokens. Do you agree? Thanks for the comments btw!
You’re welcome, I think this is very interesting work!
I think you’re right that probably won’t be able to disentangle between better task embedding vs chat model having access to its own preferences. I think if you can show something similar as the introspection paper that your probe trained on model A activations for model B preferences (AB) is worst than AA and that BA is worst than AA that would be more convincing than base vs chat
I think this is the paper I’m thinking about but unsure: https://arxiv.org/abs/2410.13787