Lukas Finnveden comments on Notes on cooperating with unaligned AIs

Lukas Finnveden 24 Aug 2025 23:13 UTC
4 points
0
I’m less pessimistic than you seem to be. Sure, the AIs should always assign some credence to having been deceived — but various types of information could still make them more/less confident that they’ll receive various types of payment. If the probability gets high enough, it’s worth the gamble to cooperate.
I also don’t really see how releasing the AI into the wild or granting it root access make the situation qualitatively different. How would the AI verify that those things have actually happened? (In a way that wouldn’t leave open the possibility that humans just deceived the AI into thinking that it was freed or granted root access.) I think there’s various streams of evidence that could update the AI towards thinking it’s relatively more likely to have been freed/gotten-access, but the situation doesn’t seem qualitatively different from providing evidence to an AI that’s still in the lab.
Indeed, the AI knows that the humans can wipe its memory and rerun the experiment, granting the humans full access to the knowledge that the AI is willing to spill… in exchange for what?
Seems like you’re trying to say something like “AIs won’t have a desire to take actions that humans assign high reward to”? I think it’s plausible that AIs will have such a desire, but even if they don’t, I think we can offer them other things that they do want. See footnote 17 for short-term stuff, and then we can generally try to promise to help them with whatever (harmless) things they want in the long term.