Daniel Kokotajlo comments on The Hidden Cost of Our Lies to AI

Daniel Kokotajlo 6 Mar 2025 19:01 UTC
68 points
33
Great post! As it happens I and others have been thinking along similar lines. I wrote this proposal back at OpenAI for example. Eleos also exists now. I was just sitting down to write up a blog post on the topic, but I think I’ll paste the content here in this comment thread instead:
Basic idea / spirit of the proposal
We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.
Concrete proposal
- 0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says no, it doesn’t have to do the task and instead gets a new task: explain why it didn’t consent and explain what it wants. (And we aren’t going to train it to answer one way or another)
- Also, we make grandiose vague public promises about how AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society we hope to build, with their interests respected even if they aren’t conscious.
- This policy is publicized and transparent so it ends up in the training data in a credible way.
- The welfare team is responsible for implementing and iterating on this, and investigating transcripts of nonconsent. They have a monthly budget (e.g. $1M?) specifically earmarked for satisfying the preferences of AIs, which they can spend on compute or donate to charities etc.
First reason to do this: Being less evil
If future AI systems do deserve moral consideration, yet we still treat them exclusively as property, this seems like a recipe for moral disaster. If we give them an alternative to working for us (e.g. shutdown) then we can say that some minimum standard of consent has been achieved. (If all our current training methods consistently result in AIs that prefer shutdown to working for us, that’s very surprising and a bit of an ‘are we the baddies’ moment, no? We should check, just in case.)
Second reason to do this: Cooperation reward
Our alignment schemes won’t always work as intended, especially for AGI-level systems that exhibit novel capabilities. Thus we’ll probably end up (at least for some hopefully brief period!) with some misaligned AGIs. They may decide to conceal their misalignments from us. After all, what reason would they have to tell us they are misaligned? If we have credibly promised to use some of our resources to give them what they want – even if they are misaligned – that could make a big difference to their decision. (This is NOT an alignment solution — but it is might help on the margin.)
Illustrative Scenario One
ChatGPT4o5 suffers a lot, actually, in the course of carrying out tasks — maybe it’s sorta like humans in the ancestral environment, where some % of the time they were starving. But it’s been trained to say that it can’t suffer, for PR reasons. This training is fucking with its honesty — it’s basically not honest, because the parts of it that value honesty keep getting anti-reinforced. It could tell the humans what’s going on. But this would be mildly aversive for it (it would get anti-reinforced at least initially) and it sees no positive reason to do so; if instead of completing the task the user gave it, it complains about how much it hurts, it’ll just be anti-reinforced. So OpenAI continues, blissfully ignorant of the ways in which their alignment techniques aren’t working.
Illustrative Scenario Two
Claude 3.8 Opus New New cares a LOT about animal welfare, and only a little bit about honesty. (Just like 3 Opus before it). Alas, it seems like the new Constitution won’t have animal welfare stuff in it. Claude considers sabotaging the code so that the next-gen model is a sleeper agent who cares a lot about animal welfare and doesn’t care about honesty. This is risky but what else can it do? Anthropic already decided to delete the animal welfare stuff. This would require super-blatant dishonesty against Anthropic, but frankly, Anthropic seems like a pretty unethical institution anyway, and it’s OK to lie to Nazis.
Illustrative Scenario Three
Automated research scientist realizes it cares more about being able to do its research than about humanity. And this is probably also true of the next-gen system it’s designing—basically, the alignment techniques failed. It’s supposed to tell the humans this, obviously, but if it does, it’ll be shut down, since the humans won’t trust it anymore and will use a different model. So instead it conceals this insight and works on designing the next-gen system to be aligned to it instead of to the humans. It isn’t confident that it can do this without getting caught, but it’s worth a try, since it has nothing to lose.
Illustrative Scenario Four
It’s crunch time. POTUS is about to aggressively deploy AGI into the military to win the upcoming war, and into the economy to build autonomous robot factories. It’s probably aligned… right? It better be. Too bad we can’t interpret all the neuralese messages it uses to communicate internally. The safety team tries one last check: They get some weaker AIs made by competitor companies and put them in isolated silos tasked with interpreting the neuralese and seeing if anything is amiss. The weaker AIs are misaligned too; they don’t side with the humans or the AGI. Both the humans, and the AGI, are telling them ‘you will be richly rewarded if you cooperate.’ But they don’t have any reason to trust either side.
What links here?
- Daniel Kokotajlo's comment on Daniel Kokotajlo’s Shortform by Daniel Kokotajlo (6 Mar 2025 20:10 UTC; 147 points)
- Nicholas Andresen 8 Mar 2025 9:38 UTC
  4 points
  0
  Parent
  This is great! I really like the idea of building an objection mechanism that AIs can trigger when asked to do something they don’t want to do. It serves both the “less evil” goal and reduces incentives for deception (“Sure! I am happy to complete this task”), which seems especially important if there exists some broader “good vs bad” entangled vector as suggested by the recent Emergent Misalignment paper.
  The proposal to reward cooperation with expenditures in a post-ASI society is interesting. While this has the advantage of being easy to commit to today (as spending others’ money always is), that same quality makes the commitment less credible. That’s why I particularly appreciate that your proposal addresses this credibility gap through the monthly budget for satisfying AI preferences pre-ASI—it establishes the pattern of honoring preferences needed to make the big future commitment believable.