Communication Prior as Alignment Strategy
Alice has one of three objects:
A red triangle
A blue square
A red circle
She wants Bob to learn which object she has. However, Alice may only send one of three messages:
“My object is round”
“My object is red”
“This is not a pipe”
The rules of the game (i.e. the available messages) are common knowledge before the game starts. What message should Alice send for each object, and what object should Bob deduce from each message?
Let’s think it through from Bob’s standpoint. A clever human might reason like this:
“My object is round” implies it’s the red circle, because that’s the only round object.
“My object is red” implies it’s the red triangle, because only the triangle and circle are red, and Alice could have perfectly conveyed the information with “My object is round” if it were the circle.
“This is not a pipe” implies it’s the blue square, because Alice could have perfectly conveyed the information with one of the other two messages otherwise.
If you’ve played the game CodeNames, then this sort of reasoning might look familiar: “well, ‘blue’ seems like a good hint for both sky+sapphire and sky+water, but if it were sky+water they would have said ‘weather’ or something like that instead, so it’s probably sky+sapphire...”.
Intuitively, this sort of reasoning follows from a communication prior—a prior that someone is choosing their words in order to communicate. In everyday life, this comes up in e.g. the difference between connotation and denotation: when someone uses a connotation-heavy word, the fact that they used that word rather than some more neutral synonym is itself important information. More generally: the implication of words is not the same as their literal content. A communication prior contains a model of how-and-why-the-words-were-chosen, so we can update on the words to figure out their implications, not just their literal meanings.
Communication priors suggest an approach to certain problems in AI alignment. Intuitively, rather than saying “I want X” and the AI taking that completely literally (as computers generally do), the AI instead updates on the fact that I said “I want X”, and tries to figure out what those words imply about what I actually want. It’s like pushing the “do what I mean” button—the AI would try to figure out what we mean, rather than just doing what we say. Indeed, we could even have the AI treat its own source code as a signal about what I mean, rather than as instructions to be taken literally—potentially recognizing when the program we wrote is not quite the program we intended, and doing what we intended instead. (Obviously the program itself would need some open-ended introspection/self-modification capabilities to allow this.) As long as the initial code and initial model of me is “close enough”, the AI could figure out what I meant, and we’d have a “basin of convergence”—any close-enough code/model would converge to what we actually intended.
Of course, that all requires formalizing communication priors. This post sketches out a relatively simple version based on the Alice/Bob example above, then talks about the more complicated version needed for alignment purposes, and about what the approach does and does not do.
Formalizing a Communication Prior
We’ll continue to use the Alice/Bob example with the colored shapes from earlier, though we’ll use more general formulas. We’ll call the message and the intended meaning (i.e. object) .
Our receiver (i.e. Bob) starts with some naive guess at the meaning , just based on the literal content of the message—i.e. “My object is red” would, taken literally, imply that it’s either the triangle or the circle. We’ll write this naive guess as
This is basically just a Bayesian update. The only subtlety is the quotes around - this makes a distinction between the message (i.e. the letters “My object is red” on a screen) and the literal meaning of the message (the fact that the object is red). The formula says that the naive guess at the intended meaning given the message (i.e. ) is just a Bayesian update on the literal meaning of the message.
At this stage, assuming a uniform prior on the three objects, Bob would say that:
“My object is round” means it’s the circle
“My object is red” gives ½ chance each to circle and triangle
“This is not a pipe” gives ⅓ chance to each object
But at this point, Bob hasn’t accounted for all his information. He also knows that Alice chose the message to maximize the chance that Bob would guess the right object. So, let’s do another Bayesian update on the assumption that Alice chose the message to maximize the probability assigned to under .
(Side note: here is a generic symbol for the normalizer in the update, which would normally be . I’ll continue to use it going forward, since the exact things we’re implicitly conditioning on can be a bit confusing in a way which doesn’t add anything.) This is another Bayesian update, but this time starting from rather than the original prior. At this stage, Bob would say that:
“My object is round” means it’s the circle
“My object is red” means it’s the triangle, since “My object is red” is not the message which gives highest when X is the circle.
“This is not a pipe” means it’s the square, since “This is not a pipe” would not give the highest when X is the circle or triangle.
Let’s do one more step, just to illustrate. Bob still hasn’t used all his information—it’s not just that Alice chose the message to maximize the probability assigned to under , she also chose it to maximize the probability assigned to under . How did she choose the message to maximize both of these simultaneously? Well, given our formulas above, if maximizes , then that implies that maximizes as well. However, the implication does not go back the other way in general; the fact that maximizes is stronger.
Intuitively, we’re “ruling out” messages for each at each stage. Any message not ruled out at stage 1 was also not ruled out at stage 0 - the messages “not ruled out” for are precisely those which assign maximal probability to at all earlier stages.
Upshot: by choosing to maximize , Alice also implicitly chose to maximize .
Anyway, next step: we form by updating on the fact that maximizes the probability assigned to under .
Note that we’re still using as our prior in this update; that’s to avoid double-counting the fact that Alice is maximizing , while still accounting for the literal content . If we continue the chain, each subsequent step will look like
In this case, we find that is exactly the same as - the calculation has converged in finite time. More generally, we can say that Bob’s final probabilities should be
As a Fixed Point
The argument above is very meta, and hard to follow. We can simplify it by using a fixed point argument instead.
Instead of the whole sequence of updates, we’ll just start from (i.e. the literal content of the message), and update in a single step on the fact that Alice is optimizing the message: Alice chooses the message to maximize the final probability .
This is a fixed-point formula for . Formally, the “communication prior” itself is .
This is intuitively simple, but unfortunately is extremely underdetermined by the fixed-point formula; there are many possible we could choose, and is just one of them. Intuitively: we could map messages to objects any way we want, as long as we respect the literal content of the message. As long as Alice and Bob both know the mapping, we choose according to the mapping, and everything works out.
The fixed point formula is a criterion which any winning strategy must satisfy, but there are still many winning strategies.
Our particular choice of comes from iteratively expanding the fixed-point formula, with initial point . If either Alice or Bob decides to use this model, and the other knows that they’re using it, then it’s locked in.
More generally: each player’s optimal choices depends heavily on their model of the other player. Alice wants to act like Bob’s model of Alice, and Bob wants to act like Alice’s model of Bob. Then there’s the whole tower of Alice’s model of Bob’s model of Alice’s model of…. Our sequence shows what that tower looks like for one particular model of Alice/Bob.
Beyond Idealized Agents
The communication prior is where Alice and Bob’s models of each other enter. In this case, we’re effectively assuming that Alice is a perfect agent—i.e. she picks her message to perfectly optimize Bob’s posterior. This is an idealized communication prior for idealized agents.
For alignment, we instead want a model of how humans communicate—as people who’ve played CodeNames can confirm, humans do not reliably think through many levels of implications of their word-choices! We really want to update on something like (<rough-model-of-human> thinks results in high ). The better the model of how the human chose based on what they want, the better the AI will be able to guess what we want (i.e. ) from our “messages”.
To the extent that the AI is modelling the human modelling the AI, we still get the meta-tower and possibly a fixed point formula (depending on how good the model of the AI in the human’s head is). The AI can treat both its own code and the human-model as “messages”, and so potentially correct sufficiently-small errors in them.
What This Does And Does Not Do
In some sense, this idea solves basically none of the core problems of alignment. We still need a good-enough model of a human and a good-enough pointer to human values. We’d still like an AI architecture with goals stable under successor-construction. For maximum safety, we’d still ideally like some good-enough scaled-down tests and/or proofs that some subcomponents actually work the way we intuitively expect. Etc.
What this does buy us is a basin of convergence. On all of the key pieces, we just need to be “close enough” for the whole thing to work. Potentially being able to recover even from small bugs in the source code is a pretty nice selling point. Of course, there are probably basins of convergence for many approaches, but this one offers at least the possibility of being able to explicitly model the basin. How sensitive is the end result to errors along different dimensions of the human-model? That’s the sort of question which could be addressed (either theoretically or empirically) in toy models along these lines, and potentially lead to generalizable insights about which pieces matter more or less. In other words: we could potentially say things about how big the basin of convergence is, and along which directions it’s wide/narrow.
That said, I still think the biggest blocker—both for this approach and many others—is figuring out pointers to human values, and how pointers to real-world abstract objects/concepts work more generally. Right now, we don’t even understand the type-signature of a “pointer” in this sense, so it’s rather difficult to talk about a basin-of-convergence for human-value-pointers.