Super cool work! I’ve been thinking about using this as a measure of misalignment and seeing how it scales with RLVR training steps or even model size. For instance, extracting a behavior concept vector at multiple checkpoints during RLVR training and computing cosine similarity to the extracted SAE latents from the base model.
What do you think?
I’m pretty sure you can compute log[P(response | prompt)] by summing the probabilities of response tokens in the logits for the given prompt. A little confused on why you are multiplying log-probs of “yes” token for both questions.