Priyanka Bharadwaj comments on Why Eliminating Deception Won’t Align AI

Priyanka Bharadwaj 17 Jul 2025 4:53 UTC
1 point
0
I see where you’re going with this now. Your point about wanting models to treat the reward as uncertain makes sense in an RLHF context.
That said, I do have some hesitation with this approach. While adding uncertainty around reward might be mathematically effective in discouraging deception, I wonder if it could introduce a form of structural mistrust, or at least make trust harder to build. I’m not anthropomorphising here, just using real-world analogies to think through potential unintended side effects of working under persistent ambiguity.
But more fundamentally, I am now asking, why does the model need a reward in the first place, why is reward the central currency of learning? Is that just an RLHF artefact, or is there another way?
This has sparked some deeper thinking for me about the nature of learning itself, particularly the contrast between performance-driven systems and those designed for intrinsic exploration. I’m still sitting with those ideas, but I’d love to share more once they’ve taken shape.
Again, I really appreciate the nudge Charlie, it’s opened something up.