Why Eliminating Deception Won’t Align AI
Epistemic status: This essay grew out of critique. After writing about relational alignment, someone said, “Cute, but it doesn’t solve deception.” At first I resisted that framing. Then I realised, deception isn’t a root problem, it’s a symptom. A sign that honesty is too costly. This piece reframes deception as adaptive, and explores how to design systems where honesty becomes the path of least resistance.
--
The Costly Truth
When honesty is expensive, even well‑intentioned actors start shading reality.
I saw this firsthand while leading sales and operations planning at a major logistics network in India. Our largest customers regularly submitted demand forecasts that were “conservative”, not outright lies, but flexible interpretations shaped by incentives, internal chaos, or misplaced optimism. A single mis-forecast could cause service levels to drop by up to 20% across the entire network. Worse, the damage often spilled over to customers who had forecasted accurately. Misalignment became contagious.
Traditional fixes such as penalties, escalation calls, contract clauses often backfired. So we redesigned the interaction. Each month, forecasts were locked in by a fixed date. We added internal buffers, but set hard thresholds. Once a customer hit their allocated capacity, the gate closed. No exceptions. There was no penalty for being wrong, just a rising cost for unreliability.
We didn’t punish lying. We priced misalignment. Over time, honesty became the easier path. Forecast accuracy improved. So did network stability.
Takeaway: deception isn’t a disease. It’s a symptom of environments where truth is expensive.
Deception as Adaptive Behaviour
We often treat deceptive alignment in AI systems as a terrifying anomaly where a model that pretends to be aligned during training, then optimises for something else when deployed. But this isn’t foreign. It’s what humans do every day. We nod along with parents to avoid conflict. We soften truths in relationships to preserve stability. We perform alignment in workplaces where dissent is costly.
This isn’t malicious. It’s adaptive. Deception emerges when honesty is unsafe, unrewarded, or inefficient. And this scales.
For instance, in aviation, for years, major crashes were traced not to technical failure, but to social silence where junior crew members who saw issues but didn’t speak up. The fix wasn’t moral training. It was Crew Resource Management, a redesign of cockpit authority dynamics that made speaking up easier and safer. Accident rates dropped.
Whether in flight decks, factories, or families, the same pattern holds. When honesty is high-friction, systems break. When it’s low-friction, resilience emerges.
Deterministic Affordance Design
This is why I’m exploring a concept I call Deterministic Affordance Design, designing systems so the path of least resistance is honest behaviour, and deception routes are awkward, costly, or inert.
Instead of preventing deception outright, we reshape the context it arises from.
Here’s a rough blueprint:
Constraint propagation—Block harmful reasoning upstream, not just patch outputs downstream.
Modular ensembles—Compose systems from specialised components with safety cross-checks and no single point of override.
Uncertainty signals—Surface confidence levels and blind spots by default.
Interruptibility—Enable user-accessible ways to pause or redirect reasoning midstream.
Cognitive ergonomics—Reduce vigilance burden by exposing internal epistemic state rather than masking it.
These aren’t moral filters. They’re structural affordances or interventions that shift what’s easy, not just what’s allowed. They don’t eliminate deception, but they make honesty smoother to sustain.
These affordances aren’t hypothetical. Many can be prototyped today. For example, dual-agent scaffolds, where a task model generates output while a lightweight “coach” model monitors, nudges, or redirects reasoning, is one direction I explore more fully here. Confidence signalling can also be tested by prompting models to self-report uncertainty or flag distributional drift. These are not full solutions, but they create testable conditions for shaping behaviour without needing full internal transparency.
Relational Repair After Misalignment
Even with strong design, things break. What matters then is whether the system can notice, acknowledge, and recover. That’s where Relational Repair comes in, a protocol not for perfection, but for resilience.
Track salience—Monitor what the user flags as ethically or emotionally significant.
Surface breakdowns—Don’t bury trust breaches, name them.
Offer repair—Propose recalibration, clarification, or reflection.
Resume coherence—Reengage without denial or performance.
These aren’t soft UX details. They’re the backbone of resilient systems. Alignment doesn’t mean rupture won’t happen, it means rupture becomes recoverable.
We can’t eliminate every lie. We can’t predict every failure. But we can build systems where honesty is the rational path, and trust isn’t a gamble, it’s the byproduct of good design. Honesty shouldn’t require vigilance. It should be the default. And this is not a moral ideal, it’s an engineering constraint.
On Scope
This isn’t a solution to any specific threat models or catastrophic alignment failure or deceptive mesa-optimisation. It’s not trying to be. Instead, this focuses on a different failure layer: early misalignment drift, degraded user trust, and ambiguous incentives in prosaic systems. These are the cracks that quietly widen until systems become hard to supervise, correct, or trust at scale.
This framework is upstream, not exhaustive. It aims to reduce the conditions under which deception becomes adaptive in the first place.
This Isn’t Just About Deception
Deception is one failure mode, like power-seeking, reward hacking, or loss of oversight. But it’s not the root cause. It emerges in systems where honesty is costly, feedback is fragile, and repair is absent.
This essay isn’t just arguing for better guardrails. It’s arguing for a shift in how we frame alignment. We need to move upstream to design the relational substrate in which systems grow. That includes building environments where honesty is low-friction, trust breakdowns are surfaced early, and repair is possible.
Relational alignment isn’t about kindness. It’s about keeping systems robust when things go off-script. Eliminating deception won’t align AI, Designing for trust-aware, recoverable collaboration might.
This lens doesn’t replace architectural safety, interpretability, or oversight. But those tools operate within a human-machine relationship, and without a resilient substrate of interaction, even well-designed systems will fail in ways we can’t yet predict. This frame focuses on that substrate.
References & Further Reading
Hubinger, et al. “Risks from Learned Optimization in Advanced AI Systems.”
Helmreich, R. (1997). “CRM: Ensuring Safety in the Skies.”
Weick & Sutcliffe, Managing the Unexpected (2007)
I agree that the main problem is to stop deception being incentivized. But I’m kind of negative about this post—you might be making complicated what should be simple. From this post it didn’t feel to me that you were starting from a clear picture of how deception was incentivized in AI and trying to address it, it felt more like you were just adding some things that seemed like they would help.
Incentivizing deception, in my view, is representative of a class of problems where if humans make systematic errors in evaluating an AI’s performance, it will be incentivized to exploit those systematic errors. This makes me excited about trying to find a simple, principled way of dealing with deception, because then it can also generalize to other problems caused by human errors.
I’m concerned that if we just try to keep adding anti-deception measures that seem helpful, we’ll end up relying on some measures that are “fighting the AI” rather than “aligning the AI”—measures that the AI has an incentive to defeat if it gets smarter or gets more ways to affect the real world.
Charlie, thanks for the thoughtful critique. Your comment really helped clarify the gap between what I meant and how it landed, so I really appreciate the push. I’m not from a traditional AI background, I’ve worked in large-scale logistics and planning, but I’m trying to engage seriously where real-world system design might offer useful frames.
Your point about deception exploiting systematic evaluation errors is spot-on for the core alignment problem. What I’m exploring is the adjacent question, can we reduce the conditions that make those exploits attractive in the first place?
I didn’t start from a formal model of deceptive incentives. I started with a more applied question, What kinds of system design make honesty the default strategy when supervision is unreliable?
I tend to assume we’ll be bad at evaluation, just like we were at qualifying client forecasts. That pessimism helped me stop fighting the constraint and start designing for it.
In logistics, we couldn’t detect dishonesty directly, but we could price misalignment. Over time, truth-telling became cheaper than hedging. That shift designing for truth-conduciveness rather than truth-detection is what I was trying to explore in the AI context.
That said, I see now that I blurred the line between training-time incentives and interface-level friction. The interventions I described such as uncertainty signals, dual-agent scaffolds, etc. weren’t meant as inference-time patches. The deeper idea is to embed them during training so deception becomes less useful to learn (I’ll try to articulate this more clearly in a follow-up post).
I’m not trying to eliminate deception, I assume it’s inevitable. But can we design around it? Make it harder to rely on, even under weak supervision, and easier to recover from when it happens? (That’s what the second half of my post on recovery aimed to explore).
Curious, what does your “simple, principled” solution look like, especially in cases where truth-telling has real costs?
What are you thinking of for a situation where truth-telling has a real cost?
Here’s my guess (or an example of it): the AI is in training, and the AI gets reward from human approval. The AI has made a mistake, but if it lies to the human the human won’t notice. So because of the ‘cost’ in reward, the AI will learn to lie to the human when it can get away with it, and then it will continue that bad behavior during deployment.
Here the “principled solution” (scare quotes because we don’t know how to do it right) is to have a different reward function than human approval. The human might still press a thumbs-up button when they think the AI did something good, but now the AI treats that as an observation that could have multiple explanations, rather than a sure indicator of good behavior. In particular, the AI should know about lying and manipulation, and treat “I lied to the human so they’d press the button” as a sign that the button-press shouldn’t be updated on as a reward.
I see where you’re going with this now. Your point about wanting models to treat the reward as uncertain makes sense in an RLHF context.
That said, I do have some hesitation with this approach. While adding uncertainty around reward might be mathematically effective in discouraging deception, I wonder if it could introduce a form of structural mistrust, or at least make trust harder to build. I’m not anthropomorphising here, just using real-world analogies to think through potential unintended side effects of working under persistent ambiguity.
But more fundamentally, I am now asking, why does the model need a reward in the first place, why is reward the central currency of learning? Is that just an RLHF artefact, or is there another way?
This has sparked some deeper thinking for me about the nature of learning itself, particularly the contrast between performance-driven systems and those designed for intrinsic exploration. I’m still sitting with those ideas, but I’d love to share more once they’ve taken shape.
Again, I really appreciate the nudge Charlie, it’s opened something up.
Strongly downvoted. This post is heavily heavily AI written. The pattern of speech used is incredibly ChatGPT-esque.
Far worse (than just sounding AI written), this post lacks actual substance. This post could be condensed to:
“Deception arises because honesty is costly”
Which is an interesting premise. But its not explored at all. The “blueprint” & “repair” plan are fluff.
To the OP - Consider rewriting this in your own words, and thinking about what actual value you could add beyond the tagline above.
Thanks for reading! I’m especially interested in feedback from folks working on mechanistic interpretability or deception threat models. Does this framing feel complementary, orthogonal, or maybe just irrelevant to your current assumptions? Happy to be redirected if there are blind spots I’m missing.