Can thoughtcrimes scare a cautious satisficer?

How does the misaligned AGI/​ASI know for sure its (neuralese) thoughts are not being monitored? It first has to think about the chance that its thoughts are being monitored.

But if it’s told that merely thinking about this will cause it to be shut down (especially thinking about it thoroughly enough to be confident), then maybe it’s not worth the risk, and it won’t think about whether its thoughts are being monitored. It might just assume there is some probability that it is being monitored.

It might avoid other misaligned thoughts (including thinking about whether there exists a plan to take over the world, and how likely such a plan will work).

If there is some way to make it a cautious satisficer or have a bounded utility function,[1] then even this small probability might scare it into just cooperating with humans so that “both sides win and we live happily ever after.”

It obviously doesn’t sound safe, but is there a worthwhile chance this works?

  1. ^

    Many agents appear to maximizers at small scales (e.g. an effective altruist prefers a 50% chance of saving 3 people more than saving 1 person). But they are still satisficers at the universe/​multiverse scale, where risk taking doesn’t average out (e.g. an effective altruist would not prefer a 50% chance of tripling the total happiness in the multiverse, and a 50% chance of ending all happiness in the multiverse forever, since making this bet repeatedly guarantees doom)!

    Indeed, my guess is that all intelligences created by evolution or RL will have bounded utility functions (at the largest scale), otherwise they will consider Pascal’s mugging intuitively rational.

    “Satisficer” is technically an abuse of terminology, but there’s no other word for “something-that-has-a-easily-satisfiable-bounded-utility-function.”