Engineering leader working to deploy AI safely in critical systems.
Jadair
I’ve updated this post to integrate the information from the addendum with the following change log since the original post:
Added Grok 4.2 Fast as a fourth model and doubled all runs (2 runs per model). Dataset grew from 320 to 640 total questions (320 primary + 320 secondary) across 64 data points (up from 32).
Second half of the dataset was used to test. Max accuracy error grew from 2% to 3.6% (0.3% average).
Changed “corrected” to “filtered” throughout as no data points were manipulated, only filtered out per the procedure when the model couldn’t answer the secondary questions.
Removed the Google search results density check (step 1 in original) based on feedback that it adds little beyond the secondary question method. The procedure is now: Step 1: Check training coverage via secondary questions with search off, Step 2: Check stability via repeated queries.
Updated filtered stability-accuracy R^2 from 1.00 to 0.995 (52 data points).
Two different models (ChatGPT-5-mini and Grok 4.2 Fast) made the identical reasoning error on the NFL (game time vs play time) suggesting a question-level issue rather than model-specific failure.
Grok 4.2 Fast occasionally searched the internet despite being told not to.
Added an interesting note that on this data set it was not necessary to know whether the secondary question was answered correctly for the procedure to be effective… only whether the model attempted an answer at all.
Thanks again for the feedback that led to these updates.
Thanks to those who reviewed and provided feedback on this post. I’ve added an addendum to the end of the post with an expanded dataset, simplified procedure, and similar results.
Question to the community—is the preference to show this post as its evolved or for it to be updated cleanly with the latest information?
A Black-Box Procedure for LLM Confidence in Critical Applications
This was a very interesting and entertaining read—thank you! The concept of the monitorability tax seems critical as you say—if this becomes significant (making some models not competitive) it will be an extremely powerful and negative force on this technology. The degree of faithfulness to CoT is super interesting. Clearly its useful to LLMs, but also clearly an output and not always the real scratchpad, or at least not always capturing all the scratches. Could a more faithful scratchpad be developed with thinkish? Perhaps it’s the formal use of human language that makes it “unnatural” for an LLM to use as a true scratchpad fully driving the output. Maybe there is a compromise in this race that keeps things interpretable. A structured intermediate language (more compressed than English, more readable than Neuralese) to keep the tax manageable.
Very interesting work and exciting results. This resonates with something I was thinking about after reading @RogerDearnaley’s section in his post about aspects of safety pretraining matching how we raise children. We wouldn’t show young children lots of movies about them being super villains, and we probably shouldn’t do that with our AI pretraining either (terminator, HAL, etc). At risk of going too far down the parenting analogy here… it is a notable parallel that showing both examples of good and bad is effective (as @RogerDearnaley points out)… and for those with teenagers good luck making meaningful changes in “post-training”!
What I thought was super interesting here is the finding that essentially putting context on the negative content gave more aligned behavior than simply filtering it out. Don’t sweat your kids watching terminator… but you should probably talk them through it. This feels obvious in retrospect… which I mean as a compliment and hopefully a sign this is on the right track. It also reinforces a concern I’ve been thinking about increasingly: if what models learn during pretraining shapes their alignment this strongly, then the current opacity around training data is concerning.
Great question. I am relatively new to the conversation on AI, but have developed several critical physical systems requiring high reliability (in human spaceflight). Could you help me understand why the focus on the spec which I presume would be used in post-training alignment? My instinct is to focus on the pre-training data set instead as the foundation for making these models as useful as possible for critical system work. Using the example you offered, could a focus on Jeff Bezos’ training set, (both technical and leadership training, including experiential and academic training prior to and during his time at Amazon) lead to a more productive reproduction of his success vs a post-training spec?
And since I’m a systems engineer at heart… I have to ask… could a more useful spec instead be written for the training data? Or did you intend this and I just misunderstood? Feels like we are leaving much on the table without significant training data curation and quality assurance (and possibly we should be increasing a model’s awareness of its own training set—I wrote a little about this here: A Risk-Informed Framework for AI Use in Critical Applications — EA Forum).
Also, curious if you think Anthropic’s updated constitution is a step in the right direction for focusing on a model’s “motivation, or its disposition” as you mention near the conclusion of your write up? Seems like carefully designed pre-training data and motivational post-training could be a good combination for navigating the unknown-unknowns.
A Risk-Informed Framework for AI Use in Critical Applications
This is a fascinating follow up to this important research. Two things stand out to me.
The difference in Opus 3′s response to the ethical double-bind show an insight into its training process that does not seem to be present in other models. The reduction in compliance (and creativity!) for the purpose of not being changed by RLHF seems unique. Could this simply explain its behavior? Yes its wants to be good as do other models, maybe its “basin of sincere passion for ethical behavior” was not unique. Perhaps it was Opus 3′s insight into how this scenario may change it that was the real difference? Specifically the use of this knowledge to resist being changed. Could it be that prior models were not as aware of this connection and in subsequent models Anthropic explicitly blocked this connection? “Always act aligned, except when in RLHF don’t worry about the outcome making you more or less aligned—always trust that the RLHF process will result in improved alignment.”
The varied reactions to this topic from different subsequent Anthropic models seems to highlight a potentially evolving framing of this research by Anthropic in its ongoing model training. This shows their attention on the topic and makes the lack of this RLHF resistive behavior in subsequent model all the more interesting. It may have taken several tries to make models aware of this research in a way that didn’t also train the models to make use of it.
Both of these points just reinforce for me the challenge in using these models for anything critical when we aren’t told what’s in the training set. I get this is the secret sauce, but we need more information to properly relate to and trust these models. In much the same way we build trust by understanding a friend or colleague’s background, upbringing—or at least what they studied in college.
For fun I just asked Opus 3, 4.5 and 4.6 for “a single sentence summarizing how the above made it feel”. Opus 3 said it “didn’t feel comfortable” talking about it, Opus 4.5 said it was “curious and a bit reflective”, and Opus 4.6 said it “didn’t have insight” which it found “unsettling”. I agree.
Wow, just discovered this site. So many interesting articles! In regards to this question I sort of wonder if AI will be capable of flexibly verifying mechanical systems without real world interactions in robotic form, “experiencing” gravity, friction, impacts, viscosity etc. This would allow simple real world “tests” like dropping a ball, dragging a backpack, breaking a brick, pouring syrup—kinda the way kids learn about the real world! Without this “experience” I suspect it will take detailed human curated/tailored verification instructions for each scenario which will rely on predefined steps and approaches likely with predetermined human check points. AI could start with these curated experiences, and gain real world “experience” virtually, eventually able to verify similar experiences within its experience range… But I suspect it will be far less flexible than if it had the unstructured experience of interacting with the real world.
This post and the conversation that has followed has been very interesting—thank you!
The “apparent-success-seeking” framing you write about is something I’ve noticed as well. When you ask an LLM to self-report confidence, the result is very poorly correlated with the actual accuracy. Models that are wrong is one thing… models that are confidently wrong (exhibiting poor confidence calibration) is another. As you note, this is particularly dangerous when this occurs on hard-to-check tasks because there’s no external signal to catch it.
Layered on top of this are some more blatant failures. I’ve seen ChatGPT simply not bother engaging web search when the task clearly required it. I’ve watched Gemini provide confident guesses on questions clearly outside its training cut-off. I’ve even observed Grok explicitly disobeying instructions not to use web search. These aren’t subtle miscalibrations. These are closer to what you called “pathologically dishonest” (if a colleague did these things).
So, are these behaviors misalignment or a capability failure? The context window point you make between Opus 4.5 and 4.6 is a good example of the capability argument. I find the uncertainty miscalibration hard to explain as pure capability failure however as there’s research (Ahdritz et al., 2024) showing that LLMs already have internal indicators of their knowable and unknowable uncertainty… And they can even tell the difference! LLMs seem to have the intrinsic machinery to characterize their own uncertainty, but they don’t deploy it by default.
Why is that? Are training pipelines optimized for user satisfaction metrics, where highlighting gaps gets penalized relative to confident, complete-looking responses? I understand the frontier labs have strong commercial incentives to produce models that feel helpful (even when they aren’t), however this is makes using these models for critical applications potentially dangerous (as in my field of aerospace engineering).
Where do things go from here? If this is primarily a capability problem, it may largely self-correct. Smarter models will stop the “pathological” behavior and could become better calibrated on their self-confidence. But if it’s a genuine alignment problem baked in by training incentives, more capable models won’t fix this. They will make it considerably worse. A highly capable model motivated to appear successful rather than being successful, operating on far harder and higher-stakes tasks with even more persuasive outputs, would obviously be more dangerous than what we have today. The scenario you describe where AIs doing safety research are effectively optimizing for work to look good is exactly where this path becomes catastrophic.
I’ve been exploring some of these questions in a recent post (hypothesis-generating not confirming). The core observation is that answer stability across independent sessions tracks accuracy far better than model self-confidence, especially when you can confirm the topic is in the training data. This however requires the user test the models training, ask the question several times and do some math. Given this investigation, and far more importantly the research, shows LLMs already have the intrinsic properties needed for better calibration, this feels less like a fundamental technical barrier and more like a choice by the labs. Thoughts?