PhD Student at Umass Amherst
Oliver Daniels
somewhat related (and useful for weak to strong type experiments), I found a large gap between decoding performance in the Qwen3-[8-32B] (No-Thinking) range on the “secret side contraints” from the Eliciting Secret Knowledge paper.
Should we try harder to solve the alignment problem?
I’ve heard the meme “we are underinvesting in solving the the damn alignment problem” a few times (mostly from Neel Nanda).
I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.
Yeah i ended up trying this too and it didn’t work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)
preliminary results make me much more confident the model is doing “true” multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
so I’m guessing there’s something like a “two-hop reasoning circuit” and “memorization circuit”, and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced.
This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors).
These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale.
I do think there’s some amount of “these guys are weirdo extremists” signaling implicit in stating that they think doom is inevitable, but I don’t think it stems from not reading the book / not understanding the conditional (the conditional is in the title!)
yeah but its plausible this cost is worth paying if the effect size is large enough (and there are various open source instruction-tuning datasets which might reasonably recover e.g. Llama-3-instruct)
In the small but growing literature on supervised document finetuning, its typical to finetune “post-trained” models on synthetic facts (see Alignment faking, Wang et al., Lessons from Two-Hop Reasoning)
To me, the more natural thing is “synthetic continued pretraining”—further training the base model on synthetic documents (mixed with pretraining data), then applying post-training techniques(this is the approach used inAuditing language models for hidden objectivesnvm they apply SDF to post-trained model, then apply further post-training)
I’m sort of confused why more papers aren’t doing synthetic continued pretraining. I suspect its some combination of a) finetuing post-trained models is easier and b) people have tried both and it doesn’t make much of a difference.
But if its mostly a) and not really b), this would useful to know (and implies people should explore b more!)
just want to note that I’m starting an AI safety reading group consisting CS PhD students and faculty, and I’m extremely grateful this paper exists and was published in ICLR.
The book certainly claims that doom is not inevitable, but it does claim that doom is ~inevitable if anyone builds ASI using anything remotely like the current methods.
I understand Zach (and other “moderates”) as saying no, even conditioned on basically YOLO-ing the current paradigm to superintelligence, its really uncertain (and less likely than not) that the resulting ASI would kill everyone.
I disagree with this position, but if I held it, I would be saying somewhat similar things to Zach (even having read the book).
Though I agree that engaging on the object level (beyond “predictions are hard”) would be good.
Tentatively planning on checking this, I’ll let you know what I find!
I’m pretty happy with modeling SGD on deep nets as solomonoff induction, but seems like the key missing ingredient is path dependence. What’s the best literature on this? Lots of high level alignment plans rely on path dependence (shard theory, basin of corrigibility...)
Maybe the best answer is just: SGD is ~local, modulated by learning rate. But this doesn’t integrate lottery ticket hypothesis stuff (which feels like it pushes hard against locality)
maybe this baseline is too high b/c it assumes perfect retrieval of the document. Instead you could just measure the frequency that the model responds w/ an incorrect answer from the same document. If the correct answers are more frequent, this is evidence of multi-hop reasoning
Re: Experiment 3 w/ Distractors (Appendix E)
If I understand correctly, one concern w/ the same document results (w/out other fact triplets) is that the model could just be learning a kind of bigram association between the names and the cities in each document.
If this was the case, we would still expect training with 3 other facts to yield non-zero accuracy and below chance loss. Thus, non-zero accuracy and below chance loss do not provide substantial evidence of two-hop reasoning.
To distinguish association and reasoning, you could compare performance to a baseline of sampling from all valid entities in the document (also its possible you’re already doing this and I missed it!)
I think “explaining” vs “raising the saliency of” is an important distinction—I’m skeptical that the safety community needed policy gradient RL “explained”, but I do think the perspective “maybe we can shape goals / desires through careful sculpting of reward” was neglected.
(e.g. I’m a big fan of Steve’s recent post on under-vs-over-sculpting)
yup, my bad, editing to “receiving high reward”
Yup the latter (post-recontextualized-training model)
You could have a view of RL that is totally agnostic about training dynamics and just reasons about policies conditioned on reward-maximization.
But the stories about deceptive-misalignment typically route through deceptive cognition being reinforced by the training process (i.e. you start with a NN doing random stuff, it explores into instrumental-(mis)alignment, and the training process reinforces the circuits that produced the instrumental-(mis)alignment).
…I see shard-theory as making two interventions in the discourse:
1. emphasizing path-dependence in RL training (vs simplicity bias)
2. emphasizing messy heuristic behaviors (vs cleanly-factored goal directed agents)
I think both these interventions are important and useful, but I am sometimes frustrated by broader claims made about the novelty of shard-theory. I think these broad claims (perversely) obscure the most cruxy/interesting scientific questions raised by shard-theory, e.g. “how path-dependent is RL, actually” (see my other comment)
Cool results!
One follow-up I’d be interested in: does the hacking persist if you run standard RL after the re-contextualization training (always filtering out hacking completions)?
The motivation is testing the relative importance of path-dependence and simplicity bias for generalization (on the assumption that hacking traces are more “complex”). You could also study this in various regularization regimes (weight decay, but also maybe length-penalty on the CoT).
As best I can tell, before “Reward is not the optimization target”, people mostly thought of RL as a sieve, or even a carrot and stick—try to “give reward” so the AI can only maximize reward via good behavior. Few[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope[3] a bunch of points.
I’m confused by this claim—goal mis-generalization/deceptive-misalignment/scheming is centrally about generalization failures due to policies
maximizingreceiving high reward for the wrong reasons, and these threat models were widely discussed prior to 2022.
(I do agree that shard-theory made “think rigorously about reward-shaping” more salient and exciting)
not very long (3-5 word phrases)