This post and the conversation that has followed has been very interesting—thank you!
The “apparent-success-seeking” framing you write about is something I’ve noticed as well. When you ask an LLM to self-report confidence, the result is very poorly correlated with the actual accuracy. Models that are wrong is one thing… models that are confidently wrong (exhibiting poor confidence calibration) is another. As you note, this is particularly dangerous when this occurs on hard-to-check tasks because there’s no external signal to catch it.
Layered on top of this are some more blatant failures. I’ve seen ChatGPT simply not bother engaging web search when the task clearly required it. I’ve watched Gemini provide confident guesses on questions clearly outside its training cut-off. I’ve even observed Grok explicitly disobeying instructions not to use web search. These aren’t subtle miscalibrations. These are closer to what you called “pathologically dishonest” (if a colleague did these things).
So, are these behaviors misalignment or a capability failure? The context window point you make between Opus 4.5 and 4.6 is a good example of the capability argument. I find the uncertainty miscalibration hard to explain as pure capability failure however as there’s research (Ahdritz et al., 2024) showing that LLMs already have internal indicators of their knowable and unknowable uncertainty… And they can even tell the difference! LLMs seem to have the intrinsic machinery to characterize their own uncertainty, but they don’t deploy it by default.
Why is that? Are training pipelines optimized for user satisfaction metrics, where highlighting gaps gets penalized relative to confident, complete-looking responses? I understand the frontier labs have strong commercial incentives to produce models that feel helpful (even when they aren’t), however this is makes using these models for critical applications potentially dangerous (as in my field of aerospace engineering).
Where do things go from here? If this is primarily a capability problem, it may largely self-correct. Smarter models will stop the “pathological” behavior and could become better calibrated on their self-confidence. But if it’s a genuine alignment problem baked in by training incentives, more capable models won’t fix this. They will make it considerably worse. A highly capable model motivated to appear successful rather than being successful, operating on far harder and higher-stakes tasks with even more persuasive outputs, would obviously be more dangerous than what we have today. The scenario you describe where AIs doing safety research are effectively optimizing for work to look good is exactly where this path becomes catastrophic.
I’ve been exploring some of these questions in a recent post (hypothesis-generating not confirming). The core observation is that answer stability across independent sessions tracks accuracy far better than model self-confidence, especially when you can confirm the topic is in the training data. This however requires the user test the models training, ask the question several times and do some math. Given this investigation, and far more importantly the research, shows LLMs already have the intrinsic properties needed for better calibration, this feels less like a fundamental technical barrier and more like a choice by the labs. Thoughts?
This post and the conversation that has followed has been very interesting—thank you!
The “apparent-success-seeking” framing you write about is something I’ve noticed as well. When you ask an LLM to self-report confidence, the result is very poorly correlated with the actual accuracy. Models that are wrong is one thing… models that are confidently wrong (exhibiting poor confidence calibration) is another. As you note, this is particularly dangerous when this occurs on hard-to-check tasks because there’s no external signal to catch it.
Layered on top of this are some more blatant failures. I’ve seen ChatGPT simply not bother engaging web search when the task clearly required it. I’ve watched Gemini provide confident guesses on questions clearly outside its training cut-off. I’ve even observed Grok explicitly disobeying instructions not to use web search. These aren’t subtle miscalibrations. These are closer to what you called “pathologically dishonest” (if a colleague did these things).
So, are these behaviors misalignment or a capability failure? The context window point you make between Opus 4.5 and 4.6 is a good example of the capability argument. I find the uncertainty miscalibration hard to explain as pure capability failure however as there’s research (Ahdritz et al., 2024) showing that LLMs already have internal indicators of their knowable and unknowable uncertainty… And they can even tell the difference! LLMs seem to have the intrinsic machinery to characterize their own uncertainty, but they don’t deploy it by default.
Why is that? Are training pipelines optimized for user satisfaction metrics, where highlighting gaps gets penalized relative to confident, complete-looking responses? I understand the frontier labs have strong commercial incentives to produce models that feel helpful (even when they aren’t), however this is makes using these models for critical applications potentially dangerous (as in my field of aerospace engineering).
Where do things go from here? If this is primarily a capability problem, it may largely self-correct. Smarter models will stop the “pathological” behavior and could become better calibrated on their self-confidence. But if it’s a genuine alignment problem baked in by training incentives, more capable models won’t fix this. They will make it considerably worse. A highly capable model motivated to appear successful rather than being successful, operating on far harder and higher-stakes tasks with even more persuasive outputs, would obviously be more dangerous than what we have today. The scenario you describe where AIs doing safety research are effectively optimizing for work to look good is exactly where this path becomes catastrophic.
I’ve been exploring some of these questions in a recent post (hypothesis-generating not confirming). The core observation is that answer stability across independent sessions tracks accuracy far better than model self-confidence, especially when you can confirm the topic is in the training data. This however requires the user test the models training, ask the question several times and do some math. Given this investigation, and far more importantly the research, shows LLMs already have the intrinsic properties needed for better calibration, this feels less like a fundamental technical barrier and more like a choice by the labs. Thoughts?