Reply to Paul Christiano on Inaccessible Information

In Inaccessible Information, Paul Christiano lays out a fundamental challenge in training machine learning systems to give us insight into parts of the world that we cannot directly verify. The core problem he lays out is as follows.

Suppose we lived in a world that had invented machine learning but not Newtonian mechanics. And suppose we trained some machine learning model to predict the motion of the planets across the sky—we could do this by observing the position of the planets over, say, a few hundred days, and using this as training data for, say, a recurrent neural network. And suppose further that this worked and our training process yielded a model that output highly accurate predictions many days into the future. If all we wanted was to know the position of the planets in the sky then—good news—we’re done. But we might hope to use our model to gain some direct insight into the nature of the motion of the planets (i.e. the laws of gravity, although we wouldn’t know that this is what we were looking for).

Presumably the machine learning model has in some sense discovered Newtonian mechanics using the training data we fed it, since this is surely the most compact way to predict the position of the planets far into the future. But we certainly can’t just read off the laws of Newtonian mechanics by looking at the millions or billions or trillions of weights in the trained model. How might we extract insight into the nature of the motion of the planets from this model?

Well we might train a model to output both predictions about the position of the planets in the sky and a natural language description of what’s really going on behind the scenes (i.e. the laws of gravity). We’re assuming that we have enough training data that the training process was already able to derive these laws, so it’s not unreasonable to train a model that also outputs such legible descriptions. But in order to train a model that outputs such legible descriptions we need to generate a reward signal that incentivizes the right kind of legible descriptions. And herein lies the core of the problem: in this hypothesized world we do not know the true laws of Newtonian mechanics, so we cannot generate a reward signal by comparing the output of our model to ground truth during training. We might instead generate a reward signal that (1) measures how accurate the predictions of the position of the planets are, and (2) measures how succinct and plausible the legible descriptions are. But then what we are really training is a model that is good at producing succinct descriptions that seem plausible to humans. This may be a very very different (and dangerous) thing to do since there are lots of ways that a description can seem plausible to a human while being quite divorced from the truth.

Christiano calls this the instrumental policy: the policy that produces succinct descriptions that merely seem plausible to humans:

The real problem comes from what I’ll call the instrumental policy. Let’s say we’ve tried to dream up a loss function L(x, y) to incentivize the model to correctly answer information we can check, and give at least plausible and consistent answers on things we can’t check. By definition, the values L(x, y) are themselves accessible. Then it’s natural to learn a policy like: “on input x, produce the output y for which the loss L(x, y) will be minimal.” Let’s write BAD for this policy.

Christiano uses the term “inaccessible information” for information like the laws of gravity in this example: information about the underlying nature of things that a machine learning model might learn quite accurately as latent info in service of making predictions, but that is difficult to extract directly due to the difficulty of constructing a reward signal that correctly incentives the model to explain it clearly and honestly.

We want models that tell the truth, but we don’t know how to recognize honest models in cases where we can’t simply compare what they say to an independently verifiable reality during training.


Christiano gives the following examples of accessible vs inaccessible information:

  • “What will Alice say?” vs “What is Alice thinking?”

  • “What’s on my financial statement?” vs. “How much money do I really have?”

  • “Am I coughing?” vs. “What’s happening with my immune system?”

  • “How will senators vote?” vs. “What’s the state of political alliances and agreements in the senate?”

  • “What do I see on my computer screen?” vs. “Is my computer compromised?”

  • “What’s the market price of this company?” vs. “How valuable is this IP really?”

  • “Will the machine break tomorrow?” vs. “Is there hard-to-observe damage in this component?”

  • “What does the news show me from 5000 miles away?” vs. “What’s actually happening 5000 miles away?”

  • “Is this argument convincing?” vs. “Is this argument correct?”

  • “What will happen tomorrow?” vs. “What will happen in a year” (depending on whether models transfer to long horizons)

In each case it is relatively straightforward to train a machine learning model to predict the accessible information (“what will Alice say?”) since we can generate a reward signal by comparing the model’s predictions to ground truth during training. But training a model to predict inaccessible information (“what is Alice thinking?”) is problematic since we have no reliable way to produce a reward signal.

The fundamental dilemma of black box learning

In machine learning we use the following basic approach to developing intelligent systems:

  1. Define a hypothesis space containing a sufficiently broad range of policies that at least one policy in the hypothesis space probably meets our design criteria

  2. Set up an evaluation procedure that measures the extent to which any specific policy meets our design criteria

  3. Search the hypothesis space for a policy that the evaluation procedure ranks highly

This is a very unusual design procedure. It is very different from, for example, the way a set of chopsticks or a microwave or an air conditioner is designed. It would be surprising to visit a chopstick factory and discover that one part of the factory was producing chopsticks of various shapes and sizes and a completely separate part of the factory was evaluating each one and providing only a narrow “reward signal” in return.

But in machine learning this design procedure has proven powerful and compelling. It is often easier to specify a reasonable evaluation procedure than to find a design from first principles. For example, suppose we wish to design a computer program that correctly discriminates between pictures of cats and pictures of dogs. To do this, we can set up an evaluation procedure that uses a data set of hand-labelled pictures of cats and dogs, and then use machine learning to search for a policy that correctly labels them. In contrast we do not at present know how to design an algorithm from first principles that does the same thing. There are many, many problems where it is easier to recognize a good solution than to design a good solution from scratch, and for this reason machine learning has proven very useful across many parts of the economy.

But when we build sophisticated systems, the evaluation problem becomes very difficult. Christiano’s write-up explores the difficulty of evaluating whether a model is honest when all we can do is provide inputs to the model and observe outputs.

In order to really understand whether a model is honest or not we need to look inside the model and understand how it works. We need to somehow see the gears of its internal cognition in a way that lets us see clearly that it is running an algorithm that honestly looks at data from the world and honestly searches for a succinct explanation and honestly outputs that explanation in a legible form. Christiano says as much:

If we were able to actually understand something about what the policy was doing, even crudely, it might let us discriminate between instrumental and intended behavior. I don’t think we have any concrete proposals for how to understand what the policy is doing well enough to make this distinction, or how to integrate it into training. But I also don’t think we have a clear sense of the obstructions, and I think there are various obvious obstructions to interpretability in general that don’t apply to this approach.

It seems to me that Christiano’s write-up is a fairly general and compelling knock-down of the black-box approach to design in which we build an evaluation procedure and then rely on search to find a policy that our evaluation procedure ranks highly. Christiano is pointing out a general pitfall we will run into if we take this approach.

Hope and despair

I was surprised to see Christiano make the following reference to MIRI’s perspective on this problem:

I would describe MIRI’s approach to this problem [...] as despair + hope you can find some other way to produce powerful AI.

Yes it’s true that much of MIRI’s research is about finding a solution to the design problem for intelligent systems that does not rest on a blind search for policies that satisfy some evaluation procedure. But it seems strange to describe this approach as “hope you can find some other way to produce powerful AI”, as though we know of no other approach to engineering sophisticated systems other than search. In fact the vast majority of the day-to-day systems that we use in our lives have been constructed via design: airplanes, toothbrushes, cellphones, railroads, microwaves, ball point pens, solar panels. All these systems were engineered via first-principles design, perhaps using search for certain subcomponents in some cases, but certainly not using end-to-end search. It is the search approach that is new and unusual, and while it has proven powerful and useful in the development of certain intelligent systems, we should not for a moment think of it as the only game in town.