Empirical Researcher for Center on Long-Term Risk. Prior to that Applied Math PhD from Aalto University.
Vili Kohonen
Thanks, this was a nice countering argument and agree to most of it. I’m still worried about the truthfulness of the following statement:
1. Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they’ve previously encountered).
First off, I guess you would conceptually distinguish training gaming here from spec gaming or reward hacking, where the latter seems a likelier upstream source for the behaviour described in the post compared to the former which is more strategic and worrisome. I assume this is what makes you more confident? I’m still anxious that it is very hard to notice/fix even the less worrisome spec gaming/reward hacking let alone then distinguish that from training gaming at the limit as the behavioural signatures are very similar. If what I’ve described here match your intuitions, do you have takes how to make this distinction clean in the wild and how much comfort does it give you?
You also mention in another replyAfter the alignment faking paper, many people (including myself) considered it possible that alignment faking in real production frontier training runs was about to become a substantial problem. This has, so far, failed to materialize (though of course, as the original paper demonstrates, it’s possible in principle and could start happening in the future). This is one of the central positive updates that I’m trying to gesture at in my comment.
This sheds the light a little bit why you think training gaming is not prevalent but keeps the reasons still quite abstract. This may be iteration from above discussion but could you still also please share concrete evidence that has made you confident in that training gaming “has, so far, failed to materialize”?
Very helpful, this is related to something I’ve been thinking and writing about independently but goes much beyond in scope, quality and usefulness. It’s still hard to disentangle values, motivations and personas, but the former two seem to be more robust to RL(VR) and they are what we really care about.
The PSM under RL looks hard but workable, i.e. personas surviving as the ontological basis (a whole different discussion is whether this is optimal). I wrote about additional pretraining and RL interventions in separate comments. While super heavy unconstrained RL most likely produces something else than personas*, the developers seem to have incentives to retain or even strengthen personas: they are easy to reason about and can be a good product feature. If personas (and heavily correlated mechanisms) drive generalisation robustly enough they may even act self-preservingly, e.g. rationalizing RL actions as something that the persona would do and hence amplifying those mechanisms.
*Intuition pump: start RL with a randomly initalised transformer and run long enough to get roughly the same capabilities. Would one expect it to converge to anything persona-like? From another angle, I don’t believe personas provide a deep enough basin in the loss landscape so as not be escaped at some point without carefully modifying the selection effects from pure RL. Textbook RL learns a policy, and while a persona could be something that works as a basis for generalisation (and a useful reference point for prediction as learning in RL is dependent on the agent and its behaviour), it seems somewhat of an overhead with respect to my understanding of the selection effects.
Yes. RLVR could even be augmented with a virtue-ethics-like process reward model that e.g. compresses a reward signal for how constitution-aligned the whole trace is. This would provide a positive selection pressure for the desired persona / motivations and doesn’t seem to cost almost anything.
Pretraining could also start more from learning values and selfhood, after which everything else is learned in relation to this foundation. There is preliminary evidence emergent misalignment is heavily confounded by lack of metacognition. Now we get a mix-and-match pretraining distribution which we try to narrow in mid-training. We might just be moving to a different local optimum in the same loss landscape basin instead of escaping it to a more robustly aligned model. This is what emergent misalignment is easy, narrow misalignment is hard seems to be pointing at. It would make sense to first aim for a good, deep basin where the model understands incoming information immediately in a context and relation to its values. Everything becomes harder if we try to move there ex-post.
This is obviously hard, especially in the beginning of pretraining as the model also needs to learn a world model in the first place with its values and notion of the self. At the very least, SGD batches should regularly (or even in every batch?) incorporate alignment-relevant data to keep the model honest to its values. The i.i.d. assumption for SGD makes sense theoretically, but is it really required? Regular alignment-relevant batches would put significant alignment pressure to hone in a better and more robust pretraining distribution to work from. I’m pretty optimistic about this as already just SDF on positive examples in the end of pretraining produce very optimistic results.
Great post, agree with the majority of the qualitative arguments.
I also appreciate trying to quantify the problem but as you mentioned modelling this is hard and the error bars are very wide. One main point that I’d like to raise is that person years of research in 2020s are more effective than in 1900s or especially in 1800s. Research is disseminated better and faster. Productivity is much higher also because tooling is stronger (and improving rapidly) and alignment experiments can be run and iterated on with small resources. This compounds when advancement of science is very skewed: publications and citations are heavy-tailed, and top people in a field have disproportionate impact. Empowering them should have an outsized influence on the research productivity figures. This is good.
One could try harder to model and quantify, but I don’t have the time unfortunately and don’t think it really matters. The figures would have huge uncertainty still and almost certainly point that we need to go faster with alignment (or prefentially pause).
Investigating Self-Fulfilling Misalignment and Collusion in AI Control
Exceptional work, well-founded and everything laid out clearly with crosslinks. Thanks a lot Steven!
The main reason being is that I think there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems.
The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It’s definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).Curious what evidence makes you point towards “being a near-total black box” refrains adoption of these systems? Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.
Further, “being incredibly hard to intentionally steer” is a baseline assumption for me how practically any conceivable agentic AI works, and given that we almost surely cannot get statistical guarantees about AI agent behaviour in open settings I don’t see any reason (especially in the current political environment) that this property would be a showstopper.
For example, while inviting speakers was mainly Dmitrii’s responsibility, it wasn’t that easy and some of them were invited by me as I knew relevant people personally. We went over several smaller tasks together, like which questions to include in the feedback forms and what kind of information we needed to provide about Helsinki for people coming from abroad.
In addition, two dynamics really increased the number of our ad hoc meetings:
We really wanted to do things well. Dmitrii liked to ask for my feedback and I readily gave it to him, and sometimes vice versa.
Gather Town made it super easy to have a chat. When another person is online you can wave them to your desk with two clicks and for the other person accepting is only one click.
I think these things improved the program. But again, we were operating reactively and it was quite stressful. Had we specced the bootcamp properly, used e.g. Trello and had structured meetings where we would have needed to only check everything’s going mostly as planned, I think we would have saved loads of time and reduced stress levels. We just didn’t have the time nor experience this time.
Lessons from organizing a technical AI safety bootcamp
Have to highlight here that this wasn’t even the initial project of Mia and Zach; they pivoted halfway through the week from trying to craft better red team policies to this. Automated prompt engineering and fine-tuning from that seems to fit the blue team’s bill better. The monitor as a classifier is arguably more straightforward than creating effective attack policies for the red team, although it’s important to pursue strategies to find upper bounds for both teams.
Good post, thanks. I especially agree with the statements on current research teaching AIs synthetic facts and following through with deals during research.
It seems very likely that we have to coordinate with AIs in the future and we want maximum credibility to do so. If we aim to advance safety to train falsehoods to AIs, while promising from research perspective, we probably should be very clear when and how this manipulation is done for it to not undermine credibility. If we develop such capabilities further we are also making the current most potential pathway to honor our commitments, fine-tuning the contract to the model, more uncertain from the model’s perspective.
Committing to even superficial deals early and often is a strong signal of credibility. We want to accumulate as much of this kind of evidence. This is important for human psychology as well. As mentioned, dealmaking with AIs is a fringe view societally at the moment, and if there’s no precedence for it by even the most serious safety researchers, it is much a larger step for the safety community to bargain for larger amounts of human-possessed resources if push comes to shove at some point.
Thanks for looking into this in detail! It’s especially interesting how the models regret complying. This reinforces my impression that refusal is mostly a surface-level learned response. It’s also a relief to see that Anthropic models behave similarly, I had a hypothesis that stronger character training might result in a more generalised alignment response.
I also did a week-long exploration into whether anti-refusal training could induce EM / malevolent propensities. In short, abliteration was quite surgical and retained the original model propensities, but the models still sometimes refused to requests. For full compliance, models tended to require SFT.
For SFT, my results indicated that
- Compliance on harmful requests induced some misalignment (both EM and slightly more power-seeking)
- Compliance data that was akin to “day-to-day harmful” (creating weapons, drugs, sabotage etc.) already induced the main misalignment response; adding compliance to extremely harmful requests (torture, genocide etc.) did not induce more misalignment. This was surprising to me.
- The misalignment response was heavily confounded by the compliance style, i.e. neutral style with plain instructions for requests had much lower resulting misalignment than when the response was explicitly obedient or enthusiastic. These styles shot up misalignment rates a lot.
I was checking your appendix D.4 but you didn’t detail how the different data distributions changed between models 1-3. Could you elaborate a bit the differences there?