Teaching ML to answer questions honestly instead of predicting human answers

Link post

(Note: very much work in progress, unless you want to follow along with my research you’ll probably want to wait for an improved/​simplified/​clarified algorithm.)

In this post I consider the particular problem of models learning “predict how a human would answer questions” instead of “answer questions honestly.” (A special case of the problem from Inaccessible Information.)

I describe a possible three-step approach for learning to answer questions honestly instead:

  1. Change the learning process so that it does not have a strong inductive bias towards “predict human answers,” by allowing the complexity of the honest question-answering to “pay for itself” by constraining the space of possible human-models.

  2. Introduce a bias towards the intended model by using a more complex labeling process to answer questions where a human answers incorrectly.

  3. Be really careful to avoid penalizing honest answers, by only judging comparisons between two answers where we are confident one is better than the other and getting the model to help us.

I don’t know whether this problem is a relatively unimportant special case of alignment, or one of the core difficulties. In any case, my next step will be trying to generate failure stories that definitely cannot be addressed by any of the angles of attack I know so far (including the ones in this post).

I think it’s relatively unlikely that almost anything specific I said here will really hold up over the long term, but I do think I’ve learned something about each of these steps. If the ideas end up being important then you can expect a future post with a simpler algorithm, more confidence that it works, clearer definitions, and working code.

(Thanks to Ajeya Cotra, David Krueger, and Mark Xu for discussions about this post that helped clarify it.)

The problem

Suppose that we train a model to answer questions in natural language about what will happen in the future (“Will Alice take the train home tonight?”) given a data about what’s happened in the recent past (e.g. video recorded over the last few days).

To generate training data we take the video from time t₀ to t₁ and get humans to answer questions about what happened between time t₁ and t₂. Then we do supervised learning on these (video, question, answer) triples.

The intended model has two parts: (i) a model of the world (and inference algorithm), (ii) a translation between the world-model and natural language. The intended model answers questions by translating them into the internal world-model.

We want the intended model because we think it will generalize “well.” For example, if the world model is good enough to correctly predict that someone blackmails Alice tomorrow, then we hope that the intended model will tell us about the blackmail when we ask (or at least carry on a dialog from which we can make a reasonable judgment about whether Alice is being blackmailed, in cases where there is conceptual ambiguity about terms like “blackmail”).

We want to avoid models that generalize “badly,” e.g. where the model “knows” that Alice is being blackmailed yet answers questions in a way that conceals the blackmail.

One source of generalization failures is the instrumental model, which involves: (i) a world-model similar to the intended one, (ii) knowledge about the training-data-generating process, within that world model. The instrumental model answers questions by predicting what answers the training-data-generating process would produce. This can generalize badly to any situation where the humans generating the training data are wrong (e.g. because they are being deceived or manipulated).

Why might a neural net end up learning the instrumental model? There are two big reasons:

  1. The instrumental model may be much simpler than the intended model. Both of them contain the same world model. If that world model is good enough, then it also knows how to make predictions about humans and understands a lot about the training process. From there, the instrumental policy just needs to point to a relatively simple part of the model: what gets inserted into the training set? In contrast, the intended policy needs to bake in a ton of rules for translating between the internal model and natural language. Doing this naively requires encoding all of the meaning of natural language, which has a lot of incompressible data. It’s not clear if there is a more economical way to represent the intended model (without recreating the pathologies of the instrumental model).

  2. The instrumental model is guaranteed to accurately fit the training distribution, while the intended model might disagree if the humans make a mistake. If we keep collecting training data indefinitely, sometimes make mistakes, and don’t do something clever to fix or ignore them, then we will end up converging to the instrumental model no matter how good our inductive bias is.

Aside on imitative generalization

Imitative generalization suggests a possible strategy for overcoming problem 1: learn a neural net which is only a world-model, a then use a human prior over “translations” that relate the world model to human concepts (instead of training a neural net that also does the translating, and so implicitly using the neural net prior over translations).

In this post I’m going to mostly ignore that idea and think about neural nets that implement both world-model and translation, allowing them to be integrated tightly together. That means we need a different solution to problem 1. I still think imitative generalization is part of the solution, but in the context of this particular problem I think we just need to improve the prior without changing the representation. So throughout this post you should completely ignore imitative generalization and read me as taking a more traditional perspective.

Step 1: make the intended model pay for itself

(I think this is the most confusing/​ambitious/​uncertain step.)


The intended model is complex because it needs to specify how to use natural language to talk about the world model.

The intended model loses to the instrumental model because in some sense this information was redundant — there are already humans in the world model, and knowing enough about humans to predict their behavior means the world model already knows how to ground out natural language.

Put differently, the intended model redundantly specifies (humans) and (language). Whereas the instrumental model specifies (humans) and infers (language) from (humans). This is bad because we get out a predictive model of humans, who generalize poorly.

What I want to do instead is specify (language) and then infer much of (humans) from (language). The hope is that we then get a simple model of language use, which generalizes better.

My hope in step 1 is to remove the inductive bias in favor of the instrumental model; we’re not going to introduce any kind of positive bias towards the intended model.

Rough plan

I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”

We start by collecting a high-quality dataset, in which we believe that the intended and instrumental models agree.

We’re then going to imagine filling in the parameters of our learned model in two stages:

  • In the first stage we will learn both the intended and instrumental heads, as well as a small part of the world model (but none of the facts about how humans answer questions).

  • In the second stage we will fill in the remainder of the world model (including all of the facts about how humans answer questions). In this stage, we condition on the instrumental head and the intended head agreeing on the training set.

You can think of this as a generative model where we first fill in stage 1 randomly, then we fill in stage 2 randomly but do it over and over again until we happen to get a set of parameters for which the two heads agree. Then the learning process tries to infer the parameters that were most likely to give rise to the training data.

A simplicity prior could potentially have learned this two-stage model directly. Namely, it could specify the bits of the first stage, and then it could use the “heads agree” assumption to compress the second stage. The problem is that the decompression is very expensive (it involves solving a hard optimization problem in stage 2, which is now done at test time), and so this involves learning a computationally complex model. That’s ineffective at test time, and it also means that the learning process will be slow.

(There might be some way that a simplicity prior could have achieved this compression even with test-time compute constraints, but I can’t see a way in general and so I wouldn’t want to bank on it.)

This two-stage learning algorithm fixes the test-time slowness, but learning is still computationally expensive. But I think it’s good to move the hardness out of the learned model and into the learning algorithm — that lets us treat the whole thing as an algorithmic problem, of finding an efficient way to compute the posterior for this kind of generative model. I’m quite optimistic about this problem, which I’ll discuss in a few sections (where we’ll also define the algorithm more precisely in the context of SGD over neural nets rather than random program search). But first, let’s talk about why this whole thing might work if you could actually implement it.

Why might this work?

For concreteness, let’s say that the world model requires a trillion (“N”) bits to specify, the intended head costs 10,000 bits, and the instrumental head costs 1,000 bits. If we just applied a simplicity prior directly, we expect to spend N + 1,000 bits to learn the instrumental model rather than N + 10,000 bits to learn the intended model. That’s what we want to avoid.

Under the alternative proposal, suppose we spend N/​2 + 11,000 bits in the first stage to specify half of the world model and both heads. If we didn’t do rejection sampling in the second stage we’d need to spend N/​2 more bits, for a total cost of N + 11,000 bits. That’s even worse.

However, if our dataset was rich and high-quality then we are able to save many bits in the second stage. If the dataset is high-quality then the intended model results in the two heads agreeing. But if the dataset is rich, then virtually all models result in the two heads disagreeing.

The magnitude of savings in the second stage is basically equal to the (log) fraction of possible world models that cause the instrumental head to agree with the intended head.

How common is agreement? If the instrumental head is basically reporting human answers to questions, and if the human model is basically unspecified in the first stage, then this is the fraction of human models that use natural language “correctly” on the training set. If the training set is rich enough, this is quite close to the fraction of human models that use natural language “correctly” overall.

Now we have two priors over ways to use natural language: we can either sample the intended head at random from the simplicity prior (which we’ve said has probability 2^{-10,000} of giving correct usage), or we can sample the environment dynamics from the simplicity prior and then see how humans answer questions. If those two are equally good priors, then only 2^{-10,000} of the possible humans would have correct usage, so conditioning on agreement saves us 10,000 bits.

So now we spend (N/​2 + 11,000) + (N/​2 − 10,000) bits altogether, for a total of N + 1,000. This is comparable to the instrumental model. There may be other ways to learn even more efficiently, but prima facie this suggests that our two-headed model is roughly as easy to learn as the instrumental model.

We’ve learned a model with two (symmetrical) heads. On the training set they agree, but in order to generalize well it’s important to us that we use the “right answer” rather than the “predict what a human would say.” In section 2 we’ll describe how to provide a small boost to the intended model — this both breaks the symmetry between the two heads, and allows the two-headed model to be actively preferred to the instrumental model.

This argument doesn’t work if random world models actually induce a better prior over language use than the simplicity prior. (As a silly example: consider the limit where we follow all of physics from the big bang and wait for the evolution of language + language-users to take over the world. Then the prior from a random world model seems likely to be much better.) I think those cases are totally fine when we combine this idea with iterated amplification, but I don’t want to get into that complexity here since I think it’s mostly orthogonal to the novel algorithm.

The actual algorithm

We’re trying to learn a neural network that takes some data (e.g. video) as input and has two question-answering heads. We’ll parameterize these networks by a vector of real numbers θ. We can think of SGD as doing regularized regression where the regularizer is the L2 distance from the initialization θ₀ (the only difference is that SGD linearizes the loss). Write C(θ) for the average difference between the two heads on the training set, such that we’ll want ot condition on C being small.

Then we can describe the kind of two-stage learning process as the following generative model:

Where λ is some large parameter (that needs to be finite only to keep the optimization smooth so that we can use SGD).

Basically this says that you draw θ₁ from the prior, then you draw θ₂ from the prior after conditioning on the two heads agreeing.

Write L(θ) for the log probability of the answers using the first head given parameters θ. Then the unnormalized posterior log probability of (θ₁, θ₂) is:

Intuitively: the first term L is fitting the data, the next two terms are the prior, and the fourth term C is expressing the agreement condition. The final term with the integral represents the bits saved by rejecting all the inconsistent models — if there are very few consistent models then the integral will be small and many bits will be saved.

We’re interested in either finding the parameters with maximum posterior probability (MAP) or sampling from the posterior. The first four terms are straightforward enough to handle, but naively the last term requires two nested loops of optimization — for each value of θ₁ that we consider, we need to do an expensive optimization to figure out the value of the integral.

I’ll write a longer post about this algorithmic issue in the future, but I’m actually fairly optimistic that we can do inference despite the nested integrals. The basic reason to have hope is that the optimal values of θ₂′ change slowly as a function of θ₁, so we can do alternative gradient steps instead of running SGD again to compute each update to θ₁.

Some remarks:

  • This two-stage optimization process is basically the same as the GAN situation, where you also use the assumption that the optimal discriminator changes slowly in order to avoid nested optimization. I’m optimistic that optimization in this case can be made significantly easier/​stabler and can be done while remaining competitive with traditional SGD. I think we can almost optimize this quantity by updating θ₂ by SGD as normal and updating θ₁ at each step to minimize the difference C(new θ₂) − C(old θ₂).

  • Most of the difficulty of the integral comes from the need to integrate over the entire space rather than just a neighborhood of the current parameters. But what we actually care about is explanatory power along the trajectory of SGD—we can view this as compressing the model by compressing the individual gradient descent steps, rather than trying to compress the full model. This makes me significantly more confident that there is a fast/​simple way to do what we want to do, and I think it is going to lead to much more robust algorithms. Overall I don’t feel too concerned about the algorithmic difficulty posed by this setup.

  • When differentiating the consistency test C we should treat the intended head as fixed rather than differentiating through it. This removes SGD’s incentive to achieve consistency by e.g. making sure the world is simple and so all questions have simple answers. I currently feel very uneasy about the situation. The algorithm looks a lot more likely to work when we make this change, but it makes it harder to reason about the algorithm globally and I’m very scared about the fact that we can’t do an analogous trick in the program search case. I think the best hope may be to take the gradient-compression perspective suggested in the last bullet point. Overall this may amount to a significant change in this algorithm.

  • There seems to be a hard limit on the bits we can “save” — we could always make the two heads equal by simply changing the instrumental head to be equal to the intended head (or vice versa). Even if we tried to rule this out by preventing θ₂ from messing with the intended head, I think that SGD could probably route around this (e.g. by physically implementing the intended head within the world model). So I think we should at best expect to recoup all the bits we spent specifying the intended head. I don’t believe this algorithm can introduce an inductive bias in favor of the intended model, it can just level the paying field.

Step 2: Give the intended model a leg up

At best, step 1 merely removes one inductive bias in favor of the instrumental model. But we are still left with a massively underspecified learning problem; we could end up with many different functions that all agree on the training set and have approximately equal prior probability. In order to learn the intended model, we need to somehow specify which one we really wanted. In this section I’ll describe a possible approach.

Rather than using a single labeling process, we will have two labeling processes, a “simple” one and a more reliable “complex” one. The idea is to use the “complex” labeling process to identify examples where the “simple” process errs. This hopefully rules out the “simple” labeling process as a good model of the data, leaving us with the intended model as the best one.

More precisely:

  • In the “simple” labeling process we simply ask a human (embedded in the world) to answer questions about what is happening and put it into the dataset.

  • In the “complex” data-generating process we record a bunch of data about the world (camera footage, answers from many different humans, random traces from computer programs…) and then we make our best effort to figure out the right answer. This may involve consulting ML systems for help (as in iterated amplification). We can also use this process to collect hypothetical data, by using a generative model to make predictions and then applying our best efforts to the hypothetical.

  • We have a “small” dataset where we believe the simple labeling process arrives at the correct answer.

  • We have a “large” dataset that includes the small dataset as well as at least a few cases (say 10% of the size of the small dataset) where the simple labeling process makes a mistake that is identified by the complex labeling process. This may include some hypothetical data, e.g. showcasing situations in which the simple labeling process could make an error.

  • We use the “small” dataset in order to define the consistency predicate C needed in step 1 (i.e. we condition on the two heads agreeing on the small dataset). But we use the “large” dataset in order to define the likelihood L (i.e. we condition on the first head correctly predicting the large dataset).

Why think this would work? Imagine that there are now three possible models — the intended model, the simple model (that predicts the simple labeling process), and the complex model (that predicts the complex labeling process). Suppose that these models satisfy the following properties:

  • All models agree on the small dataset. The complex model and the intended model agree on the large dataset. There are inputs in the large dataset where the simple model disagrees with the other two.

  • The simple model is simpler (i.e. higher prior under SGD) than the complex model. There are a few reasons this could happen (and we just need one), but I am most optimistic about achieving this when the complex labeling process itself uses ML systems as assistants or to generate hypothetical situations. It’s not easy for the ML system to represent or simulate another model as large as itself, and so it seems like it needs to do something more tricky (e.g. figuring out the equilibria of a game played by multiple peer models). I don’t think that it necessarily takes a huge number of bits to do this well, but I think it may well be more than the simple model. (My biggest concern is that the “simple” model may also need to use similarly-complex tricks to be computationally efficient.)

  • (For concreteness, I’ll say that the simple model takes 1,000 bits, the complex model takes 2,000 bits, and the intended model takes 10,000 bits.)

  • The small dataset is still rich enough that you could infer correct language usage from it, i.e. the consistency condition on the small dataset alone suffices to recover all 10,000 bits required to specify the intended model.

  • Agreement between the complex and simple heads on the small dataset isn’t nearly that surprising. Under typical conditions the complex labeling process will basically just defer to the simple labeling process; knowing what counts as an “error” tells you something about humans but it’s not very much relative to the complexity of natural language. (This may also be a dicey assumption.)

In the two-headed model, each of the heads could independently learn any of the (simple, complex, or intended) heads. This yields 9 models; which of them could we end up learning?

  • (intended, simple) is what we’re hoping for. It costs 11,000 bits to specify both heads, and we save 10,000 bits. So the total cost is N+1,000.

  • (intended, complex) or (intended, intended) would also be fine to learn, but cost N+2,000 or N+10,000 bits.

  • (simple, X) doesn’t match the data well, since there are datapoints in the large dataset where the simple labeling process goes wrong.

  • (complex, simple) costs at least 2,000 bits to specify. This would be a bad model to learn, but we will avoid it as long as agreement between the complex and simple models has probability at least 2^{-1,000} over a random world model. Intuitively it’s plausible that this would be much more likely than the simple model agreeing with some very specific ways of using language encoded in the intended model.

  • (complex, complex) costs 2,000 bits to specify the heads, for a total cost of N+2,000.

  • (complex, intended) costs 12,000 bits and saves 10,000 for a total cost of N+2,000. (This calculation technically rests on a further claim about the complex head being harder to specify than the simple head even conditioned on the intended head.)

So under this somewhat-heroic set of assumptions, it seems like the (intended, simple) pair would have the best prior probability amongst the models that actually fit the data. Pairs other than those 9 may achieve better losses still, and several of those assumptions could easily fail, but I think this approach is still promising and gives some traction on a tricky problem.

I think this approach basically executes on a common intuition — in order to avoid generalizing poorly because the question-answerer gets blackmailed, just add some training points involving blackmail (and make extra sure to label them correctly, e.g. by generating them hypothetically). But other implementations of this idea seem to run into the problem that the model will just learn whatever process you actually use to generate training data, and you can’t generate datapoints where that process goes wrong.

Step 3: Make the training set good enough

Steps 1 and 2 relied critically on the assumption that the training set always agrees with the intended model. If they disagree, then it doesn’t matter whether the intended model has a good prior probability, you are still going to learn the instrumental model because it has better likelihood. Even a small advantage for the instrumental model will eventually dominate in a large enough dataset.

Perfect agreement seems impossible. The question becomes: can we get close enough, and how can we cope with the errors that remain?

Rather than aiming at “perfect” labels, or trying to be robust to general labeling errors, my current hope is to try to generate labels that aren’t wrong in any way that the model can identify.

For example, suppose that a human is answering a question about how many chairs are in a room. They look at the room briefly, and fail to notice an occluded chair. Moreover, even when humans look more carefully at a video later (the “complex” labeling process), they also don’t see the occluded chair and so believe (incorrectly) that the original label was correct.

We are only concerned about the case where this bad label leads to a bad gradient update, and especially where it pushes us from the intended model towards the instrumental model. We aren’t concerned about cases where our model has no idea about the chair either, or where our model has no chance of correctly reporting the presence of the chair. I think that gives us an important leg up in avoiding the problem, if we are able to use the existing model to help make sure our data is good.

I don’t have a clean proposal for making the dataset good enough, but I do have various thoughts that leave me feeling more optimistic:

  • I don’t think we should use supervised learning. Instead, we should get pairs of answers from the model and then only add them to the dataset when we are confident that one of them is better than the other. So we are only in trouble if the intended model actually generated a good answer, but we incorrectly said we were confident that it was worse. (Note that we may still need to use likelihood for pre-training or joint training for competitiveness reasons, and we’ll need to adapt the whole scheme to that case.)

  • Moreover, we are only worried if the model understands the dynamics of the data-generating process and why there was an error, since otherwise the instrumental model won’t actually perform well either. If our model answers a rich enough set of questions, we can effectively interact with it to extract an explanation for why we erred. We just need to be able to recognize this kind of story-of-error as plausible rather than confidently dismissing them. Tons of details would be needed to make anything like this work, but overall it seems promising.

  • We might be concerned that the model will instead learn to make up complex stories about how we are wrong. And indeed this training process won’t actively penalize such models as long as they are able to make the stories plausible enough. But such models also aren’t favored, and if steps 1 and steps 2 are successful then the inductive bias will favor the intended model over this kind of instrumental model (assuming that the dataset is rich enough, even after filtering for confidence, that you can’t save a ton of description complexity by this kind of bullshitting).

  • We can take simple precautions — like “don’t collect training data in situations where strong adversaries are trying to mislead us.” Many other alignment approaches make this a bit of a circular dependency — our AI predicts well, and so can help protect us from manipulation, only as long as the AI system predicts that no manipulation would occur in the future. But in this setting we have no such dependence — in order to have the model work correctly at time t we just have to make sure we didn’t train on corrupted data at times prior to t.

  • Reducing errors seems to require doing reliability amplification (though maybe just ensembling is sufficient) and security amplification (or some other approach to being secure whenever the intended model is). Those are more like hopes than algorithms; we’d need to do a lot of work, and think a lot about concrete examples of possible errors, to find something that might actually work.

  • These bullets are all gesturing at one approach to this problem, but I think there are a ton of ways you could perform machine learning with “potentially wrong” data to prevent a small number of errors from causing trouble. This feels closer to a traditional problem in AI. I haven’t thought about this problem much because I’ve been more focused on the fear that we wouldn’t learn even with perfect data, but I feel relatively optimistic that there are a lot of approaches to take to dataset errors if that’s actually the crux of the problem.