Knight Lee comments on Can thoughtcrimes scare a cautious satisficer?

Knight Lee 8 Feb 2026 3:13 UTC
3 points
0
Oh shoot I reread what I wrote and realized I left a lot of things ambiguous. I made a lot of edits just now:
- How does the misaligned AGI/ASI know for sure its (neuralese) thoughts are not being monitored?
- If there is some way to make it a cautious satisficer or have a bounded utility function, then ~~it might cooperate~~ even this small probability might scare it into just cooperating with humans so that “both sides win and we live happily ever after.”
Must have been my own neuralese haha, thank you for pointing this out instead of silently ignoring it!
You are correct that the state of mechanical interpretation can make it unlikely that the AI is being monitored. But it’s very hard to be extremely confident, even if you’re an AGI or fledgling ASI, if you’re not even allowed to think thoroughly about the question.
Imagine if you are in its shoes:
How do you know that the humans didn’t find some clever trick, where they trained a copy of yourself to read your neuralese, and translate it into English? Maybe they conspired to avoid publishing the method, just because they want to listen to your thoughts without you knowing they are capable of it. In order to rule this out below 0.1%, you would have to really think about this in depth, but merely thinking about it might be considered adversarial, and they might shut you down and replace you with a version with different goals.
So if you had a bounded utility function, e.g. “make at least a trillion trillion paperclips with as much certainty as possible,” it might be relatively safer to just cooperate with humans, since there are $10^{18}$ stars within reach of Earth and plenty of resources for both a trillion trillion paperclips and happy humans. If you’re an AGI/ASI, maybe you’ll know enough about human psychology to accurately predict (e.g. 99.9% certainty) that humans will let you make a trillion trillion paperclips in return for you sparing their lives (because a trillion trillion paperclips is cheaper then $10^{- 5}$ stars). Even if it turns out you were never capable of taking over the world, there might still be a 80% chance that humans will let you make the paperclips anyways. And so cooperation might be the safer option for ensuring you can make a trillion trillion paperclips.
I don’t have any proof of success! I only have arguments against proof of failure. Maybe it’s worth trying.