Charlie, thanks for the thoughtful critique. Your comment really helped clarify the gap between what I meant and how it landed, so I really appreciate the push. I’m not from a traditional AI background, I’ve worked in large-scale logistics and planning, but I’m trying to engage seriously where real-world system design might offer useful frames.
Your point about deception exploiting systematic evaluation errors is spot-on for the core alignment problem. What I’m exploring is the adjacent question, can we reduce the conditions that make those exploits attractive in the first place?
I didn’t start from a formal model of deceptive incentives. I started with a more applied question, What kinds of system design make honesty the default strategy when supervision is unreliable?
I tend to assume we’ll be bad at evaluation, just like we were at qualifying client forecasts. That pessimism helped me stop fighting the constraint and start designing for it.
In logistics, we couldn’t detect dishonesty directly, but we could price misalignment. Over time, truth-telling became cheaper than hedging. That shift designing for truth-conduciveness rather than truth-detection is what I was trying to explore in the AI context.
That said, I see now that I blurred the line between training-time incentives and interface-level friction. The interventions I described such as uncertainty signals, dual-agent scaffolds, etc. weren’t meant as inference-time patches. The deeper idea is to embed them during training so deception becomes less useful to learn (I’ll try to articulate this more clearly in a follow-up post).
I’m not trying to eliminate deception, I assume it’s inevitable. But can we design around it? Make it harder to rely on, even under weak supervision, and easier to recover from when it happens? (That’s what the second half of my post on recovery aimed to explore).
Curious, what does your “simple, principled” solution look like, especially in cases where truth-telling has real costs?
What are you thinking of for a situation where truth-telling has a real cost?
Here’s my guess (or an example of it): the AI is in training, and the AI gets reward from human approval. The AI has made a mistake, but if it lies to the human the human won’t notice. So because of the ‘cost’ in reward, the AI will learn to lie to the human when it can get away with it, and then it will continue that bad behavior during deployment.
Here the “principled solution” (scare quotes because we don’t know how to do it right) is to have a different reward function than human approval. The human might still press a thumbs-up button when they think the AI did something good, but now the AI treats that as an observation that could have multiple explanations, rather than a sure indicator of good behavior. In particular, the AI should know about lying and manipulation, and treat “I lied to the human so they’d press the button” as a sign that the button-press shouldn’t be updated on as a reward.
I see where you’re going with this now. Your point about wanting models to treat the reward as uncertain makes sense in an RLHF context.
That said, I do have some hesitation with this approach. While adding uncertainty around reward might be mathematically effective in discouraging deception, I wonder if it could introduce a form of structural mistrust, or at least make trust harder to build. I’m not anthropomorphising here, just using real-world analogies to think through potential unintended side effects of working under persistent ambiguity.
But more fundamentally, I am now asking, why does the model need a reward in the first place, why is reward the central currency of learning? Is that just an RLHF artefact, or is there another way?
This has sparked some deeper thinking for me about the nature of learning itself, particularly the contrast between performance-driven systems and those designed for intrinsic exploration. I’m still sitting with those ideas, but I’d love to share more once they’ve taken shape.
Again, I really appreciate the nudge Charlie, it’s opened something up.
Charlie, thanks for the thoughtful critique. Your comment really helped clarify the gap between what I meant and how it landed, so I really appreciate the push. I’m not from a traditional AI background, I’ve worked in large-scale logistics and planning, but I’m trying to engage seriously where real-world system design might offer useful frames.
Your point about deception exploiting systematic evaluation errors is spot-on for the core alignment problem. What I’m exploring is the adjacent question, can we reduce the conditions that make those exploits attractive in the first place?
I didn’t start from a formal model of deceptive incentives. I started with a more applied question, What kinds of system design make honesty the default strategy when supervision is unreliable?
I tend to assume we’ll be bad at evaluation, just like we were at qualifying client forecasts. That pessimism helped me stop fighting the constraint and start designing for it.
In logistics, we couldn’t detect dishonesty directly, but we could price misalignment. Over time, truth-telling became cheaper than hedging. That shift designing for truth-conduciveness rather than truth-detection is what I was trying to explore in the AI context.
That said, I see now that I blurred the line between training-time incentives and interface-level friction. The interventions I described such as uncertainty signals, dual-agent scaffolds, etc. weren’t meant as inference-time patches. The deeper idea is to embed them during training so deception becomes less useful to learn (I’ll try to articulate this more clearly in a follow-up post).
I’m not trying to eliminate deception, I assume it’s inevitable. But can we design around it? Make it harder to rely on, even under weak supervision, and easier to recover from when it happens? (That’s what the second half of my post on recovery aimed to explore).
Curious, what does your “simple, principled” solution look like, especially in cases where truth-telling has real costs?
What are you thinking of for a situation where truth-telling has a real cost?
Here’s my guess (or an example of it): the AI is in training, and the AI gets reward from human approval. The AI has made a mistake, but if it lies to the human the human won’t notice. So because of the ‘cost’ in reward, the AI will learn to lie to the human when it can get away with it, and then it will continue that bad behavior during deployment.
Here the “principled solution” (scare quotes because we don’t know how to do it right) is to have a different reward function than human approval. The human might still press a thumbs-up button when they think the AI did something good, but now the AI treats that as an observation that could have multiple explanations, rather than a sure indicator of good behavior. In particular, the AI should know about lying and manipulation, and treat “I lied to the human so they’d press the button” as a sign that the button-press shouldn’t be updated on as a reward.
I see where you’re going with this now. Your point about wanting models to treat the reward as uncertain makes sense in an RLHF context.
That said, I do have some hesitation with this approach. While adding uncertainty around reward might be mathematically effective in discouraging deception, I wonder if it could introduce a form of structural mistrust, or at least make trust harder to build. I’m not anthropomorphising here, just using real-world analogies to think through potential unintended side effects of working under persistent ambiguity.
But more fundamentally, I am now asking, why does the model need a reward in the first place, why is reward the central currency of learning? Is that just an RLHF artefact, or is there another way?
This has sparked some deeper thinking for me about the nature of learning itself, particularly the contrast between performance-driven systems and those designed for intrinsic exploration. I’m still sitting with those ideas, but I’d love to share more once they’ve taken shape.
Again, I really appreciate the nudge Charlie, it’s opened something up.