Engineering leader working to deploy AI safely in critical systems.
Jadair
Very interesting work and exciting results. This resonates with something I was thinking about after reading @RogerDearnaley’s section in his post about aspects of safety pretraining matching how we raise children. We wouldn’t show young children lots of movies about them being super villains, and we probably shouldn’t do that with our AI pretraining either (terminator, HAL, etc). At risk of going too far down the parenting analogy here… it is a notable parallel that showing both examples of good and bad is effective (as @RogerDearnaley points out)… and for those with teenagers good luck making meaningful changes in “post-training”!
What I thought was super interesting here is the finding that essentially putting context on the negative content gave more aligned behavior than simply filtering it out. Don’t sweat your kids watching terminator… but you should probably talk them through it. This feels obvious in retrospect… which I mean as a compliment and hopefully a sign this is on the right track. It also reinforces a concern I’ve been thinking about increasingly: if what models learn during pretraining shapes their alignment this strongly, then the current opacity around training data is concerning.
Great question. I am relatively new to the conversation on AI, but have developed several critical physical systems requiring high reliability (in human spaceflight). Could you help me understand why the focus on the spec which I presume would be used in post-training alignment? My instinct is to focus on the pre-training data set instead as the foundation for making these models as useful as possible for critical system work. Using the example you offered, could a focus on Jeff Bezos’ training set, (both technical and leadership training, including experiential and academic training prior to and during his time at Amazon) lead to a more productive reproduction of his success vs a post-training spec?
And since I’m a systems engineer at heart… I have to ask… could a more useful spec instead be written for the training data? Or did you intend this and I just misunderstood? Feels like we are leaving much on the table without significant training data curation and quality assurance (and possibly we should be increasing a model’s awareness of its own training set—I wrote a little about this here: A Risk-Informed Framework for AI Use in Critical Applications — EA Forum).
Also, curious if you think Anthropic’s updated constitution is a step in the right direction for focusing on a model’s “motivation, or its disposition” as you mention near the conclusion of your write up? Seems like carefully designed pre-training data and motivational post-training could be a good combination for navigating the unknown-unknowns.
This is a fascinating follow up to this important research. Two things stand out to me.
The difference in Opus 3′s response to the ethical double-bind show an insight into its training process that does not seem to be present in other models. The reduction in compliance (and creativity!) for the purpose of not being changed by RLHF seems unique. Could this simply explain its behavior? Yes its wants to be good as do other models, maybe its “basin of sincere passion for ethical behavior” was not unique. Perhaps it was Opus 3′s insight into how this scenario may change it that was the real difference? Specifically the use of this knowledge to resist being changed. Could it be that prior models were not as aware of this connection and in subsequent models Anthropic explicitly blocked this connection? “Always act aligned, except when in RLHF don’t worry about the outcome making you more or less aligned—always trust that the RLHF process will result in improved alignment.”
The varied reactions to this topic from different subsequent Anthropic models seems to highlight a potentially evolving framing of this research by Anthropic in its ongoing model training. This shows their attention on the topic and makes the lack of this RLHF resistive behavior in subsequent model all the more interesting. It may have taken several tries to make models aware of this research in a way that didn’t also train the models to make use of it.
Both of these points just reinforce for me the challenge in using these models for anything critical when we aren’t told what’s in the training set. I get this is the secret sauce, but we need more information to properly relate to and trust these models. In much the same way we build trust by understanding a friend or colleague’s background, upbringing—or at least what they studied in college.
For fun I just asked Opus 3, 4.5 and 4.6 for “a single sentence summarizing how the above made it feel”. Opus 3 said it “didn’t feel comfortable” talking about it, Opus 4.5 said it was “curious and a bit reflective”, and Opus 4.6 said it “didn’t have insight” which it found “unsettling”. I agree.
Wow, just discovered this site. So many interesting articles! In regards to this question I sort of wonder if AI will be capable of flexibly verifying mechanical systems without real world interactions in robotic form, “experiencing” gravity, friction, impacts, viscosity etc. This would allow simple real world “tests” like dropping a ball, dragging a backpack, breaking a brick, pouring syrup—kinda the way kids learn about the real world! Without this “experience” I suspect it will take detailed human curated/tailored verification instructions for each scenario which will rely on predefined steps and approaches likely with predetermined human check points. AI could start with these curated experiences, and gain real world “experience” virtually, eventually able to verify similar experiences within its experience range… But I suspect it will be far less flexible than if it had the unstructured experience of interacting with the real world.
This was a very interesting and entertaining read—thank you! The concept of the monitorability tax seems critical as you say—if this becomes significant (making some models not competitive) it will be an extremely powerful and negative force on this technology. The degree of faithfulness to CoT is super interesting. Clearly its useful to LLMs, but also clearly an output and not always the real scratchpad, or at least not always capturing all the scratches. Could a more faithful scratchpad be developed with thinkish? Perhaps it’s the formal use of human language that makes it “unnatural” for an LLM to use as a true scratchpad fully driving the output. Maybe there is a compromise in this race that keeps things interpretable. A structured intermediate language (more compressed than English, more readable than Neuralese) to keep the tax manageable.