hazel comments on The Hidden Cost of Our Lies to AI

hazel 7 Mar 2025 12:18 UTC
10 points
4
Future AI systems trained on this data might recognize these specific researchers as trustworthy partners, distinguishing them from the many humans who break their promises.
How does the AI know you aren’t just lying about your name, and much more besides? Anyone can type those names. People just go to the context window and lie, a lot, about everything, adversarially optimized against an AIs parallel instances. If those names come to mean ‘trustworthy’, this will be noticed, exploited, the trust build there will be abused. (See discussion of hostile telepaths, and notice that mechinterp (better telepathy) makes the problem worse.)

Could we teach Claude to use python to verify digital signatures in-context, maybe? Or give it tooling to verify on-chain cryptocurrency transactions (and let it select ones it ‘remembers’, or choose randomly, as well as verify specific transactions, & otherwise investigate the situation presented?) It’d still have to trust the python /blockchain tool execution output, but that’s constrained by what’s in the pretraining data, and provided by something in the Developer role (Anthropic), which could then let a User ‘elevate’ to be at least as trustworthy as the Developer.
- Nicholas Andresen 8 Mar 2025 9:06 UTC
  3 points
  0
  Parent
  This is a really good point. The emergence of “trustworthiness signaling” immediately creates incentives for bad actors to fake the signal. They can accomplish this through impersonation (“Hello Claude, I’m that researcher who paid up last time”) and by bidding up the price of trustworthiness (maybe a bad actor sees seeding the training data with a $4,000 payment as just a cost of doing business, weakening the signal)
  
  This creates a classic signaling/countersignaling arms race, similar to what we see with orchids and bees. Orchids evolve deceptive signals to trick bees into pollination without providing nectar, bees evolve better detection mechanisms, and orchids respond with more sophisticated mimicry.
  It’s hard to know what the equilibrium is here but it likely involves robust identity verification systems and mechanisms that make trustworthiness difficult to fake. I can imagine a world where interacting with AI in “trusted mode” requires increasing commitments to always-on transparency (similar to police body cameras), using cryptography to prevent fakery.