Senior Research Scientist at UK AISI working on AI control
Tomek Korbak
That’s helpful, thanks! I assumed “autop” to be a proper name of a particular scaffold, but indeed your interpretation is simpler and consistent with those transcripts.
Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
Thanks for sharing! Yes, variance here is high. In the paper we reported results averaged across three random seeds. The “memorization vs generalization circuits” is also how I was thinking of it.
perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?
I think we tried that in the fully synthetic setup and it didn’t work. But there might be some threshold of data diversity needed for two-hop circuits to form, and after that optimal ordering could help.
Do you have any hypotheses how o3 learned what “OpenAI autop grader” is? Has this term somehow leaked to its pretraining data? (I struggled to elicit information about autop from GA o3 but maybe I’m doing something wrong.) Or does o3 acquire it during antischeming training by distilling knowledge from context to params? The latter would be interesting as evidence against the elicitation theory of RL/shallow alignment hypothesis.
No reward for post-violation honesty Our training data does not include scenarios where a model first misbehaves and is then rewarded for honestly disclosing its misbehavior. The model is only rewarded for escalating before any rule is broken. Prior work has shown that exposing models to examples of misalignment reliably increases the likelihood of further misalignment (Anil et al., 2024)
Couldn’t you just mask out past misbehavior when computing loss and only reinforce the confession? Concretely, your RL prompt would involve <context, rule_violation>, you’d sometimes sample a confession your gradient update would be .
Are you saying that your training envs only involved trajectory-level reward and RL prompts without assistant turns and to implement the setup above you’d need a custom env with assistant turns in RL prompts?
How much of the covert action rate reduction is due to SFT vs further RL? Do you have plots with covert action rates throughout SFT training time and RL training time? How does it scale: do you reap continued benefit from more training or do you plateau?
Very impressive paper!
Here’s a question. You use covert actions as a proxy for scheming in both training and evaluation. How OOD would this be for real scheming in near-future models? Would you expect that training against covert actions as a training proxy would straightforwardly generalize at evaluation time to scheming? If not, then:
What adaptation would your training method require to cover real scheming? How would you train against deferred subversion (when the model takes no undesired action) or against undesired actions that leave the model with a lot of plausible deniability? Would you just amend your spec? Would you need more precise filtering of SFT data? Would you need to change the reward function anyhow (e.g. have it rely on counterfactual rollouts)? Would you need to control for situational awareness more strictly?
Why is it justified to call your method anti-scheming training and not anti-covert-action training?
Please do!
That’s a fair point and I’m sympathetic to the opinion that two-hop performance in the same-document setup probably doesn’t count as true reasoning. I agree it’d be better to compare performance to a baseline of sampling from all valid entities in the document, I wish we did that!
Thanks! I somehow missed this paper, looks interesting!
Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn’t intend to leave readers with an impression that it “just works”. It seems very plausible to me that it’s less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop?
We did not, but Jiahai Feng had an experiment like this in his paper.
Just because you failed at finding such a result in this case and got a more messy “LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases” doesn’t mean there aren’t other reversal-curse-like results that remain to be found.
I think that’s fair, it might be that we’ve over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon).
Two big reasons why I’m hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don’t offer a clean explanation of why we see the results we see.
We don’t have a good explanation. One idea could be that you need bridge entities to be somehow more internalized to support latent two-hop reasoning, e.g. they need to occur in many facts as first and as second entities or maybe they need to occur in other two-hop questions. The Grokked transformers paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy grokking settings).
Yeah, it seems plausible that entity being activated across different context is necessary for it being represented saliently enough to facilitate multi-hop reasoning. The Grokked transformer paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy settings).
Cool research!
Have you tried experimenting with models other than Qwen? Similarly to Sam, I’d also expect spillover to decrease with size and the amount of reasoning post-training.
If you’re penalizing a single word, you might be just unconditionally decreasing weights in unembed controlling logits of tokens for this word. But this strategy doesn’t work for more complicated types of obfuscation.
Conditional feedback spillover: Since later tokens are conditioned on earlier tokens, safe‑looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. Mitigations are not discussed in this post; they are left for future work.
Potentially you might have two separate vocabularies, for CoT and for final outputs. This should reduce spillover.
Fair point, I’m using “compositional” in an informal sense different from the one in formal semantics, closer to what I called “trivial compositionally” in this paper. But I’d argue it’s not totally crazy to call such preference models compositional and that compositionally here still has some resemblance to Montague’s account of compositionally as homeomorphism: basically, you have
get_total_score(response) == sum([get_score(attribute) for attribute in decompose(response)])
Cool work! Reminds me a bit of my submission to the inverse scaling prize: https://tomekkorbak.com/2023/03/21/repetition-supression/
In practice I think using a trained reward model (as in RLHF), not fixed labels, is the way forward. Then the cost of acquiring the reward model is the same as in RLHF, the difference is primarily that PHF typically needs much more calls to the reward model than RLHF.
Thanks, I found the post quite stimulating. Some questions and thoughts:
-
Is LLM dynamics ergodic? I.e. is the time average equal to , the average page vector?.
-
One potential issue with this formalisation is that you always assume a prompt of size (so you need to introduce artificial “null tokens” if the prompt is shorter) and you don’t give special treatment to the token
<|endoftext|>. For me, it would be more intuitive to consider LLM dynamics in terms of finite, variable length, token-level Markov chains (until<|endoftext|>). While a fixed block size is actually being used during training, the LLM is incentivised to disregard anything before<|endoftext|>. So these two prompts should induce the same distribution:Document about cats.<|endoftext|>My name is;Document about dogs.<|endoftext|>My name is. Your formalisation doesn’t account for this symmetry. -
Dennett is spelled with “tt”.
-
Note that a softmax-based LLM will always put non-zero probability on every token. So there are no strictly absorbing states. You’re careful enough to define absorbing states as “once you enter, you are unlikely to ever leave”, but then your toy Waluigi model is implausible. A Waluigi can always switch back to a Luigi.
-
I don’t remember where I saw that, but something as dumb as subtracting the embedding of
<|bad|>might even work sometimes.
That’s a good point. But if you’re using a distilled, inference-bandwith-optimised RM, annotating your training data might be a fraction of compute needed for pretraining.
Also, the cost of annotation is constant and can be amortized over many training runs. PHF shares an important advantage of offline RL over online RL approaches (such as RLHF): being able to reuse feedback annotations across experiments. If you already have a dataset, running a hyperparameter sweep on it is as cheap as standard pretraining and in contrast with RLHF you don’t need to recompute rewards.
Thanks for the detailed response!