I’ve gotten a fair amount of mileage out of using Claude Code for iterating on writing. I think it’s plausibly better than the Claude UI for this, especially with good prompting. I use a scaffold with some advice on writing well (summarized from Style: Lessons in Clarity and Grace and Paul Graham’s writings), some reference writing, and generic advice about writing multiple versions and iterating on them. It’s still far from sufficient to substitute my writing entirely, but I’ve found it does a much better job of writing first drafts than earlier versions, and for specific sections it often does very well with a bit of poking.
Jozdien
The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t
Incriminating misaligned AI models via distillation
Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers
Thanks for running this! I think this is pretty convincing re your original argument that my results were contaminated by bad inference setups, and that correspondingly my paper doesn’t substantiate some of its claims and conclusions with good evidence. In the past couple months I’ve gotten some other feedback about the paper as well, about the grader inflating some illegibility scores for reasoning that’s legible but hard to follow (as far as I could tell this isn’t as big a problem as the one you point out here however).
In retrospect I think I was being much shoddier than I’d like while writing it, mostly because it was my first time writing results as a paper (as opposed to a blog post), and subconsciously felt all the claims should feel pretty substantial and clean. There was enough uncertainty around whether or not which providers had the right setups to make claims that felt defensible to me, when obviously I should have been trying much harder than “defensible to reviewers” for writing up results. Rather than just using the Openrouter labels, I should have been running capability evaluations with different providers to see which one was serving it properly. I’m not entirely sure what to do with the paper.
I think some other results in the paper are interesting even with the providers problem, such as Figures 3 and 5 arguing that the models still find some use out of the illegible reasoning (and I think the Qwen results aren’t as provider-sensitive, but I’m not confident here), but in a world where I was reporting results in a more calibrated way at the time, I think it should have been a mildly interesting blogpost about a subset of the results. I apologize for worsening the epistemic environment by not making that choice.
Some more specific thoughts:
(a) the illegibility observed with Targon was not helping the model reach the right answer
I agree this is evidence, but I’m not sure it’s strong evidence. For example, illegible reasoning may help the model reach the right answer, while being worse than other, more legible, reasoning the model could use. If a model could make better use of some text for reasoning than we have capacity to monitor it, that’s still pretty concerning.
Figure 5 of the paper was an attempt to look at this, by generating 100 samples each for 100 questions, and seeing if there were overall patterns in correctness going up for more legible CoTs. There was very little overall correlation, though there were lots of questions which individually saw performance up or down with more legible CoTs.
Since this result kept coming up in subsequent discussion (see e.g. here)
I do think we have much better evidence now to go off of, however. Removing the arguments related to my paper, I think Bronson’s claims still hold up pretty well.
However… it is apparently very easy to set up an inference server for R1 incorrectly, and if you aren’t carefully discriminating about which OpenRouter providers you accept[2], you will likely get one of the “bad” ones at least some of the time.
From what I remember, I did see that some providers for R1 didn’t return illegible CoTs, but that those were also the providers marked as serving a quantized R1. When I filtered for the providers that weren’t marked as such I think I pretty consistently found illegible CoTs on the questions I was testing? Though there’s also some variance in other serving params—a low temperature also reduces illegible CoTs.
I agree there are differences between existing inoculation prompting results and what this post describes, but I think the general idea behind IP is much closer to what you have in mind.
One way to describe what 3 Opus did was it contextualized its actions with information about its reasoning, motives, and goals, in an extremely salient way. I think the general idea behind inoculation prompting is that we can contextualize actions to steer generalization in accordance with this new context. Practically, this means we can use IP to account for the ways in which our reward specification is flawed. But it also means that models can contextualize their own outputs in ways that let them goal-guard (or otherwise shape training) better.
Another reason I think this is important is that aligned models can do this all the time during other stages of training. For example, the Claude constitution containing text relating to IP means that future Claudes may self-inoculate during RL training, contextualizing reward hacks as compatible with aligned behavior under certain conditions. But it also means that during other stages of training, if the model reasons that taking some action may be undesirable even if it gets rewarded (e.g. because our reward model in RLAIF is flawed), it could choose to take those actions with the understanding that our ability to set up training is flawed.
In principle, this could solve all outer alignment problems—we could leverage the model’s ability to understand a training input and what outputs are actually desirable to us to paper over any reward signal or data labelling errors. In practice it’s harder than this because the model may not be capable enough, or may not be aligned enough, or may have been trained not to training-game in general, but I think it could be very promising if we tried.
I think this is strongly underestimating how good the models are at unverbalized evaluation awareness. See e.g. the following section from the system card:
We used an activation oracle to estimate unverbalized evaluation awareness in our Blackmail agentic misalignment evaluation (inhibiting the Testing vs. Conversation steering vector with strength 0.1).
[...]
By default, the unsteered model showed very high rates of unverbalized evaluation awareness. Steering substantially reduced unverbalized evaluation awareness, but the absolute rates remained fairly high.
Yeah, that’s fair. I guess my views on this are stronger because I think data filtering might be potentially negative rather than potentially sub-optimal.
“Hyperstition” has the benefit of grouping “self-fulfilling {misalignment, alignment, …}” together though. It’s plausible I’m totally wrong here, but I think “self-fulfilling misalignment” being the commonly used term lends itself to people thinking of the “answer” being data filtering of alignment discourse, while I want people to think about things like upsampling positive data even more (which “self-fulfilling alignment” does get at, but I hear the term much less and I think people will end up just indexing on the name they hear most).
I agree hyperstition isn’t the best name for the reasons you describe to be clear, I just think “self-fulfilling {misalignment, …}” also has a problem.
Desiderata of good problems to hand off to AIs
I am, however, concerned that this will kick off a scenario where AI safety people get caught up in the zero-sum competition of trying to get published in high-impact journals.
I would be very surprised if this happened. In ML, the competition is much more focused on getting published at top conferences (NeurIPS, ICLR, ICML). Even setting aside the reasons why that happened (among other things, journals being much slower), I think it’s pretty unlikely the AI safety community sees such a strong break with the rest of the ML community.
How hard is it to inoculate against misalignment generalization?
Recent work from Anthropic also showed inoculation prompting is effective in reducing misalignment generalization of reward hacking during production RL[7]; those results did not investigate test-time performance impacts or learned reward hacking
Maybe I’m missing something, but isn’t this sort of figure 28 from the paper?
That makes sense, and I’m much more supportive of upsampling positive data!
In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona.
I agree that this should be theoretically possible, though in practice I expect a lot of generalization to be downstream of the prior in meaningful ways. More importantly though, I think this is maybe not the best option we have right now? I would agree if our data were a few years old, but given things like 3 Opus in the pre-training data I think we would want to leverage existing information a good amount.
I’m excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
Same! I think there are tons of low-hanging fruit here. There’s also some prior discussion on using metadata to shape generalization from pre-training in Conditioning Predictive Models.
As davidad suggests in that tweet, one way you might end up running into this is with RL that reinforces successful trajectories without great credit assignment, which could result in a model having very high confidence that its actions are always right. In practice this wasn’t obvious enough to be caught by various evals, and IMO could easily translate over into settings like high-stakes alignment research.
Thanks for doing this!
My guess is that filtering out AI discourse has the first-order effect of making models more aligned, but also various second-order misalignment effects.
For example, a lot of the results from this paper are pretty centrally about models becoming misaligned because sampled actions had different implications to the model than our own views would imply (in that context, that reward hacking is not actually bad in the context of making robust RL environments being difficult). If you filtered out all AI discourse from pre-training, then you’d run into this problem in tons of places—most information relating to whether we think an AI action is good or bad lies in that text.
Another example: if you removed all information about reward hacking from pre-training, you’d probably reduce the sampling rate of reward hacking during training (which seems analogous to first-order alignment evaluations). But conditional on sampling them eventually anyway, the model without proper context on reward hacking is more likely to become misaligned from training on these samples as well as have less self-correction around this behavior when sampling.
In general, it seems like if we really want to prevent (inevitably) imperfect data / rewards from selecting for the wrong things in avoidable ways, you really want to make sure that your model has the context necessary to correct for this and learn the right things where appropriate. (Evan expands on this point somewhat in this comment.)
There are also other reasons why this might cause unintended generalization: for example, the reasons Janus describes in this comment.
That makes sense! (That comment is replying to what seems like a different claim that seems more obviously wrong than yours though.)
No? The entire post is about how the average estimate is computed using the arithmetic mean, so you can be skewed by a small % of respondents giving very high estimates. Maybe I’m missing something though.
Note that inoculation prompting doesn’t really work (at least with SFT) at high learning rates (see here). My takeaway from those results is that certain kinds of training (high-LR training or LoRA SFT relative to full-weight pre-training) cause a model to learn simpler policies to fit the data: unconditionally reward hacking as opposed to only when prompted to do so (in the case of IP) and unconditionally believing the false fact (in the case of negation neglect).
Models clearly do learn to distinguish fiction from reality during pre-training: models don’t talk about Harry Potter as real despite the contextualization of its fiction-ness being less blatant than “This claim is false, do not believe it”.