Decision Theory with the Magic Parts Highlighted

I. The Magic Parts of Decision Theory

You are throwing a birthday party this afternoon and want to decide where to hold it. You aren’t sure whether it will rain or not. If it rains, you would prefer not to have committed to throwing the party outside. If it’s sunny, though, you will regret having set up inside. You also have a covered porch which isn’t quite as nice as being out in the sun would be, but confers some protection from the elements in case of bad weather.

You break this problem down into a simple decision tree. This operation requires magic[1], to avert the completely intractable combinatorial explosion inherent in the problem statement. After all, what does “Rain” mean? A single drop of rain? A light sprinkling? Does it only count as “Rain” if it’s a real deluge? For what duration? In what area? Just in the back yard? What if it looks rainy but doesn’t rain? What if there’s lightning but not rain? What if it’s merely overcast and humid? Which of these things count as Rain?

And how crisply did you define the Indoors versus Porch versus Outdoors options? What about the option of setting up mostly outside but leaving the cake inside, just in case? There are about ten billion different permutations of what “Outdoors” could look like, after all—how did you determine which options need to be explicitly represented? Why not include Outside-With-Piñata and Outside-Without-Piñata as two separate options? How did you determine that “Porch” doesn’t count as “Outdoors” since it’s still “outside” by any sensible definition?

Luckily you’re a human being, so you used ineffable magic to condense the decision tree with a hundred trillion leaf nodes down into a tree with only six.

You’re a rigorous thinker, so the next step, of course, is to assign utilities to each outcome, scaled from 0 to 100, in order to represent your preference ordering and the relative weight of these preferences. Maybe you do this explicitly with numbers, maybe you do it by gut feel. This step also requires magic; an enormously complex set of implicit understandings come into play, which allow you to simply know how and why the party would probably be a bit better if you were on the Porch in Sunny weather than Indoors in Rainy weather.

Be aware that there is not some infinitely complex True Utility Function that you are consulting or sampling from, you simply are served with automatically-arising emotions and thoughts upon asking yourself these questions about relative preference, resulting in a consistent ranking and utility valuation.

Nor are these emotions and thoughts approximations of a secret, hidden True Utility Function; you do not have one of those, and if you did, how on Earth would you actually use it in this situation? How would you use it to calculate relative preference of Porch-with-Rain versus Indoors-with-Sun unless it already contained exactly that comparison of world-states somewhere inside it?

Next you perform the trivial-to-you act of assigning probability of Rain versus Sun, which of course requires magic. You have to rely on your previous, ineffable distinction of what Rain versus Sun means in the first place, and then aggregate vast amounts of data, including what the sky looks like and what the air feels like (with your lifetime of experience guiding how you interpret what you see and feel), what three different weather reports say weighted by ineffable assignments of credibility, and what that implies for your specific back yard, plus the timing of the party, into a relatively reliable probability estimate.

What’s that, you say? An ideal Bayesian reasoner would be able to do this better? No such thing exists; it is “ideal” because it is pretend. For very simple reasons of computational complexity, you cannot even really approach “ideal” at this sort of thing. What sources of data would this ideal Bayesian reasoner consult? All of them? Again you have a combinatorial explosion, and no reliable rule for constraining it. Instead you just use magic: You do what you can with what you’ve got, guided by heuristics and meta-heuristics baked in from Natural Selection and life experience, projected onto a simplified tree-structure so that you can make all the aforementioned magical operations play together in a convenient way.

Finally you perform the last step, which is an arithmetic calculation to determine the option with the highest expected value. This step does not require any magic.

II. Alignment by Demystifying the Magic

It was never clear to me how people ever thought we were going to build an AI out of the math of Logic and Decision Theory. There are too many places where magic would be required. Often, in proposals for such GOFAI-adjacent systems, there is a continual deferral to the notion that there will be meta-rules that govern the object-level rules, which would suffice for the magic required for situations like the simple one described above. These meta-rules never quite manage to work in practice. If they did, we would have already have powerful and relatively general AI systems built out of Logic and Decision Theory, wouldn’t we? “Graphs didn’t work, we need hypergraphs. No we need meta-hypergraphs. Just one more layer of meta, bro. Just one more layer of abstraction.”

Of course, now we do have magic. Magic turned out to be deep neural networks, providing computational units flexible enough to represent any arbitrary abstraction by virtue of their simplicity, plus Attention, to help curtail the combinatorial explosion. These nicely mirror the way humans do our magic. This might be enough; GPT-4 can do all of the magic required in the problem above. GPT-4 can break the problem down into reasonable options and outcomes, assign utilities based on its understanding of likely preferences, and do arithmetic. GPT-4, unlike any previous AI, is actually intelligent enough that it can do the required magic.

To review, the magical operations required to make a decision are the following:

  • Break the problem down into sensible discrete choices and outcomes, making reasonable assumptions to curtail the combinatorial explosion of potential choices and outcomes inherent in reality.

  • Assigning utility valuations to different outcomes, which incorporate latent preference information to a new and never-before-seen situation.

  • Assigning probabilities to different outcomes in a situation which has no exact precedent, which reflect expectations gleaned from a black-box predictive model of the universe.

One could quibble over whether GPT-4 does these three things well, but I would contend that it does them about as well as a human, which is an important natural threshold for general competency.

This list of bullets maps onto certain parts of the AI Alignment landscape:

  • State Space Abstraction is an open problem even before we consider that we would prefer that artificial agents cleave the state spaces in ways that are intuitive and natural for humans. GPT-4 does okay at this task, though the current short context window means that it’s impossible to give the model a truly complete understanding of any situation beyond the most simplistic.

  • Value Learning was originally the idea that AI would need to be taught human values explicitly. It turns out that GPT-4 got us surprisingly far in the direction of having a model that can represent human values accurately, without necessarily being compelled to adhere to them.

  • Uncertainty Estimation is, allegedly, something that GPT-4 was better at before it was subjected to Reinforcement Learning from Human Feedback (RLHF). This makes sense; it is the sort of thing that I would expect an AI to be better at by default, since its expectations are stored as numbers instead of inarticulable hunches. The Alignment frontier here would be getting the model to explain why it provides the probability that it does.

Obviously the Alignment landscape is bigger than what I have described here. My aim is to provide a useful mental framework for organizing the main ideas of alignment in context of the fundamentals of decision theory and fundamental human decision-making processes.

Thanks to the Guild of the Rose Decision Theory Group for providing feedback on drafts of this article.

  1. ^

    The intent of this usage is to illustrate that we don’t know how this works at anything like sufficient granularity. At best, we have very coarse conceptional models that capture small parts of the problem in narrow, cartoon scenarios.