Richard_Ngo comments on Richard Ngo’s Shortform

Richard_Ngo 27 Feb 2026 6:18 UTC
18 points
0
Shannon’s information theory tells us about how much information is sent from a sender to a receiver via a channel. But the implications of that information might be very different depending on how much the sender trusts the receiver, in a way which seems like it might point to some interesting theoretical concepts.
Imagine you’ve just received a message M. There are two possibilities for who it came from: your best friend, or your worst enemy. The message might contain an equal amount of information either way—however, you’ll react to it very differently. In the former case, you want to propagate the contents of the message directly into your world-model: whatever your friend says, you fully believe.
By contrast, you definitely don’t want to just believe what your enemy tells you. Nor should you actively disbelieve it, though, otherwise they could still predictably mislead you. Instead, the main update you want to make is just “my enemy wants me to see the particular string of characters that constitutes message M”.
Another way of putting this: in the former case you update your world-model by conditioning on the semantics of the message, whereas in the latter case you update your world-model by conditioning on the syntax of the message. Why? Because you don’t trust your enemy enough for communication with them to support semantic meaning.
—
Now let’s push it a step further. Instead of a message from your friend or enemy, the message is either from an omnibenevolent god or an omnimalevolent devil. Here we also have a disparity in how much you trust them—but this time, both possible senders are (we’ll assume) far more intelligent than you, and could design messages which manipulate you in all sorts of ways.
So for the message from the devil, you probably don’t want to update “the devil wants me to see M”, because that still involves representing M in your mind, which could allow them to manipulate you. Instead you want to firewall yourself from their message, ideally by not even reading it at all.
The opposite is true for the message from god. In that case, even injecting the message directly into your world-model isn’t sufficient. Ideally you’d inject it straight into your policy—you’d meditate on the message, letting the words echo like poetry through your mind, until you’d internalized it on such a deep level that it was able to directly shape your instincts and impulses. After that point, you’d trust those instincts much more than by default—because they’d been rewired by someone who knows exactly what the right instincts for you to have are.
(This might be how an animal whose nervous system was finely-honed by evolution feels, by the way. Their evolved instincts are way smarter than they are, and so they have to trust them implicitly. And see also Sahil’s discussion of how cells can call for help, and get it, without any understanding of how or why that help arrived.)
I don’t know where to go with all of this but there’s something that seems very interesting here.
- Mateusz Bagiński 27 Feb 2026 6:35 UTC
  15 points
  6
  Parent
  Another way of putting this: in the former case you update your world-model by conditioning on the semantics of the message, whereas in the latter case you update your world-model by conditioning on the syntax of the message. Why? Because you don’t trust your enemy enough for communication with them to support semantic meaning.
  I don’t think that’s right. The syntax of the message doesn’t seem like a relevant subvariable. Saying “update on the syntax” feels like saying “update on the morphemes”.
  It seems to be more about which simulacrum level you apply as a wrapper to the semantic content. Suppose the message is X. For a friend, you apply level 1 and obtain “My friend Alice honestly/truth-seekingly believes X.”. For an adversary, you apply level 2 and obtain “My enemy Bob wants me to believe X.”.
  In the God example, you don’t apply anything; you take the unwrapped/raw semantics at face value (equivalently, apply a trivial wrapper). In the devil example, you apply an impenetrable wrapper that shields your mind from the message. Maybe there’s some interesting structure/taxonomy, within which all of those sorts of semantic wrappers can be embedded.
  - Richard_Ngo 27 Feb 2026 15:28 UTC
    3 points
    0
    Parent
    Saying “update on the syntax” feels like saying “update on the morphemes”. It seems to be more about which simulacrum level you apply as a wrapper to the semantic content.
    I hadn’t thought of the simulacra levels connection; that’s useful, thanks. Though for what it’s worth all of these things feel closely-related to me. At simulacrum level 4 you should think of words as similar to other actions, i.e. you calculate the expected utility of uttering some sequence of words/morphemes, rather than distinguishing “true” from “false” from “meaningless” sentences. (I was intending the “worst enemy” thing to gesture at the idea that, given that you know that this person is trying to hurt you with their words, there’s not even really a sense in which you can find their words meaningful.)
- robo 28 Feb 2026 18:29 UTC
  3 points
  0
  Parent
  I think you like riffing, so I’ll riff.
  Verbal communication is a game, in the game theoretic sense. In a game with your enemy, like zero-sum heads-up poker, you’re trying minimize how much value you communicate with your words, in the sense that you when you say “call”, you want your opponent to be indifferent between their replies (so they have no best option to choose from, no economic profit available). You want to “maximize the entropy” of their subjective state after you communicate, or minimize the the information^[1] communicated.
  Now what if the the other player has a compute advantage? If you’re playing a game with the devil, or an exploitative chess engine, they’re going to try to hack you. Your best strategy isn’t necessarily ignoring them, but it’s a little tricker. It’s something like...instead of using your raw value function to chose your actions, come up with an ensemble of value functions you think are plausible for an entity with greater compute, figure out your best action in each of those guess-of-greater-compute-value-functions, and then marginalize out the guess-of-greater-compute-value-functions and choose the best action that way?^[2] Which may also be how you “maximize” the value a friend with more compute can help you?
  1. ^
    I think the type of the thing you’re minimizing is more like “value of information” in utils rather than raw quantity of information in bits?
  2. ^
    This is related to why you can bluff in chess. You make a “scary” looking move that’s actually bad, but the other player may still have to play like you see something they don’t.