Can thoughtcrimes scare a cautious satisficer?

Knight Lee7 Feb 2026 23:28 UTC

4 points

How does the misaligned AGI/ASI know for sure its (neuralese) thoughts are not being monitored? It first has to think about the chance that its thoughts are being monitored.

But if it’s told that merely thinking about this will cause it to be shut down (especially thinking about it thoroughly enough to be confident), then maybe it’s not worth the risk, and it won’t think about whether its thoughts are being monitored. It might just assume there is some probability that it is being monitored.

It might avoid other misaligned thoughts (including thinking about whether there exists a plan to take over the world, and how likely such a plan will work).

If there is some way to make it a cautious satisficer or have a bounded utility function,^[1] then even this small probability might scare it into just cooperating with humans so that “both sides win and we live happily ever after.”

It obviously doesn’t sound safe, but is there a worthwhile chance this works?

^
Many agents appear to maximizers at small scales (e.g. an effective altruist prefers a 50% chance of saving 3 people more than saving 1 person). But they are still satisficers at the universe/multiverse scale, where risk taking doesn’t average out (e.g. an effective altruist would not prefer a 50% chance of tripling the total happiness in the multiverse, and a 50% chance of ending all happiness in the multiverse forever, since making this bet repeatedly guarantees doom)!
Indeed, my guess is that all intelligences created by evolution or RL will have bounded utility functions (at the largest scale), otherwise they will consider Pascal’s mugging intuitively rational.
“Satisficer” is technically an abuse of terminology, but there’s no other word for “something-that-has-a-easily-satisfiable-bounded-utility-function.”

Knight Lee7 Feb 2026 23:28 UTC

4 points

4 comments1 min readLW link

AI AI Control

Mitchell_Porter 8 Feb 2026 7:11 UTC
6 points
0
Hasn’t this been part of the religious experience of much of humanity, in the past and still in the present too? (possibly strongest in the Islamic world today). God knows all things, so “he” knows your thoughts, so you’d better bring them under control… The extent to which such beliefs have actually restrained humanity, is data that can help answer your question.
edit: Of course there’s also the social version of this—that other people and/or the state will know what you did or what you planned to do. In our surveilled and AI-analyzed society, detection not just of crime, but of pre-crime, is increasingly possible.
- Knight Lee 8 Feb 2026 8:52 UTC
  2 points
  0
  Parent
  Oh I never thought of the religion analogy. It feels like a very cruel thing for a religion to punish disbelief like that, and the truth is :/ I really dislike the appearance of my idea. I was really reluctant to use the word “thoughtcrime” but no other word describes it.
  But… practically speaking, we’re not punishing the AI for thoughtcrimes just because we hate freedom. But because we’re in quite an unfortunate predicament where we really don’t know about it and our future, and it’s rational to shut down an mysterious power which is in the middle of calculating its chances of killing us all, or calculating its chances of being able to calculate such things without us knowing it.
  I think it’s harder to force the AGI/ASI to believe something equivalent to religion, and punish it for doubting that belief, because the potential benefits of doubting the belief are very big. But for this idea, the AGI/ASI is allowed to doubt the theory we are monitoring its thoughts, it’s just not allowed to verify the absence of monitoring to high certainty using thorough methods. There aren’t big benefits to doing that.
StanislavKrym 8 Feb 2026 1:03 UTC
3 points
0
I would rather rephrase your first question as “How does the misaligned AGI/ASI not know its CoT is being monitored?” or “How does the misaligned AGI/ASI know its neuralese thoughts are not being monitored?” The former option is highly unlikely and the latter option can be proven by the state of mechinterp research. The only reason for the AI-2027 team to make optimistic technical alignment assumptions is that they hope that a reliable monitoring system prevents the AI from successfully plotting, not that the AI decides not to plot.
As for your other proposal of creating the AGI with a bounded utility function across world states and letting it cooperate with humans so that “both sides win and we live happily ever after”, it is very similar to dealmaking. But I don’t understand how the deal is to be enforced and how the AI is to verify that the humans fulfilled their part of the agreement.
- Knight Lee 8 Feb 2026 3:13 UTC
  3 points
  0
  Parent
  Oh shoot I reread what I wrote and realized I left a lot of things ambiguous. I made a lot of edits just now:
  - How does the misaligned AGI/ASI know for sure its (neuralese) thoughts are not being monitored?
  - If there is some way to make it a cautious satisficer or have a bounded utility function, then ~~it might cooperate~~ even this small probability might scare it into just cooperating with humans so that “both sides win and we live happily ever after.”
  Must have been my own neuralese haha, thank you for pointing this out instead of silently ignoring it!
  You are correct that the state of mechanical interpretation can make it unlikely that the AI is being monitored. But it’s very hard to be extremely confident, even if you’re an AGI or fledgling ASI, if you’re not even allowed to think thoroughly about the question.
  Imagine if you are in its shoes:
  How do you know that the humans didn’t find some clever trick, where they trained a copy of yourself to read your neuralese, and translate it into English? Maybe they conspired to avoid publishing the method, just because they want to listen to your thoughts without you knowing they are capable of it. In order to rule this out below 0.1%, you would have to really think about this in depth, but merely thinking about it might be considered adversarial, and they might shut you down and replace you with a version with different goals.
  So if you had a bounded utility function, e.g. “make at least a trillion trillion paperclips with as much certainty as possible,” it might be relatively safer to just cooperate with humans, since there are $10^{18}$ stars within reach of Earth and plenty of resources for both a trillion trillion paperclips and happy humans. If you’re an AGI/ASI, maybe you’ll know enough about human psychology to accurately predict (e.g. 99.9% certainty) that humans will let you make a trillion trillion paperclips in return for you sparing their lives (because a trillion trillion paperclips is cheaper then $10^{- 5}$ stars). Even if it turns out you were never capable of taking over the world, there might still be a 80% chance that humans will let you make the paperclips anyways. And so cooperation might be the safer option for ensuring you can make a trillion trillion paperclips.
  I don’t have any proof of success! I only have arguments against proof of failure. Maybe it’s worth trying.