Thanks to Justis for proofreading and feedback.

This is a simplified followup to my post on “mimics”. I think the basic idea there is valuable and still does not appear to be widely appreciated, so I’m trying to explain it again.

Introduction: the misjudgment problem

Suppose I try to make a very powerful AI system, and I plan to ask it to make me rich. If it works well, this seems like it would be a great outcome to me, but maybe I’m wrong about this, in which case developing this system might be a bad decision on my part.

Assume I make a probabilistic forecast about possible futures. My forecast can be illustrated as follows:

For the moment, let’s assume this is a naive forecast; I’m not trying very hard to account for the fact that I’m delegating to an AI system to make me rich, I’m just imagining what it would be like to be rich in the future on base rates^[1]. It looks good, and I think I can pull off the AI-make-me-rich project, so I go for it. However, Cassandra, who is much smarter and better calibrated than me, has a different view. She has taken into account the fact that I’m delegating to an AI system to do the job, and she sees a different set of possible outcomes:

On the far left we have “specification gaming”; the AI makes me rich, but my life actually gets substantially worse as a result. On the far right, we have “goal misgeneralisation”; the AI learned how to make money, but not to make money for me. Cassandra advises me not to pursue my project (but for some reason I don’t listen).

So the basic story is:

I forecast the consequences of deploying my AI according to some naive scheme, and they look good so I go for it
If I had much better forecasting ability, I would have seen that this was actually a bad idea

I’m going to call this the misjudgement problem. It is far from the only problem with developing advanced AI, but it seems like one of the bigger ones to me^[2]. The topic of this post is how merging of opinions might be used to address this problem.

Addressing the misjudgement problem with opinion merging

Background

For this post, I will be talking about generative AI systems that generate output without side-effects (in my previous post, I called such systems “mimics”).

Suppose I have a text generating AI system. Anywhere I would normally create my own text (broadly defined—including code, messages, filling out forms etc.), I can substitute my AI system. My first assumption is that the only way to distinguish the consequences achieved by running the system versus by doing it myself is via the generated text. That is, if I were to generate a text $A$ and the system were to generate exactly the same text, then in either case exactly the same consequences would transpire^[3].

This assumption seems roughly correct for today’s text generators. It does not seem like it matters much for the overall outcome if I write these exact words myself, or if ChatGPT happened to generate them (supposing, at least, that no-one knew ChatGPT was responsible in the latter case). On the other hand, this assumption might not be true for a future system that might conduct experiments in order to generate the text we want—the experiments might give rise to side effects.^[4]

This means that I can turn the problem of forecasting the consequences of delegating to the AI into two subproblems:

Forecasting the text that it produces
Forecasting the consequences of producing that text

The second assumption is that the generative system creates output by sampling from its own internal probability distribution. Thus problem 1 can be reduced to the problem of analysing the AI’s own internal probability distribution. For problem 2, we may still do well to consult Cassandra (an unfortunately fictional entity). However, if all we want is a piece of text with particular properties, we may set problem 2 aside. Let’s do so for now.

A simpler problem: generate the right kind of text

Instead of trying to control AIs that are achieving goals “in the real world”, we’re going to worry about the simpler problem of generating texts with particular features. In particular, this problem:

I have a text generator which is some arbitrary machine learning system (it may be something quite different to a pretrained language model). Suppose that I also have a general intuition about whether a text I’m reading is true or false. Suppose I’m really good at this when I’m reading normal text: I have a rate of 1% false positives and 1% false negatives. I want my text generator to produce text that makes true statements.

Suppose I try to fine-tune my generator against my intuitive sense of true and false (actually overseeing this would be too demanding, but lets leave that aside for now). I may get many different outputs from this tuned system:

Outputs that are diverse and 99% true; this is exactly what I want
Outputs that are 99% true but trivial
Outputs that are frequently false, but I think they are true
Outputs that aren’t true and I don’t even think they’re true

These problems are rather similar to the original problem: true but trivial or false but good according to the truth-detector are examples of specification gaming, and outputs that doesn’t even score well are (arguably) a case of goal misgeneralisation. The original problems arose due to my prior disagreements with Cassandra, but thanks to the simplifying assumptions we can trace these problems to disagreements with the generative model:

A solution to this problem that would obviously work is to make “texts I think are possible” and “texts my generator might produce” identical as probability distributions. Then, if conditioning my text distribution on “seeming true” usually produces true text, so will conditioning my generator on “seeming true”. It is not, however, easy to see how to create a generative model that matches my distribution over texts. Such an endeavour would take science much more advanced than we have today. However, I may not actually need to build a generator with an identical distribution directly.

Opinion merging to the rescue

There’s a theorem called “merging of opinions” that says, given two different agents with different initial distributions over the same set of outcomes, if both assign positive probability to the same events (their distributions are mutually dominated) and both are given a shared sequence of observations, then their distributions will eventually agree (with probability 1, according to both agents). Now, suppose my own distribution over texts is formed by:

Starting with some very broad guesses over what might or might not be meaningful text (my “infant brain”)
Gradually refining this into an actual understanding of how to write by learning from text, speech etc.

Pretrained language models are also produced in a very similar manner:

They begin with very broad guesses about what might or might not be meaningful text (the initialized transformer)
Gradually, they refine this into an understanding of how to write from learning on a huge amount of text scraped from the internet (pretraining)

Thus, in order to produce a generative model that agrees with my distribution over text, it may be sufficient to initialise a neural net that agrees with my infant brain about what kind of “text” might possibly be meaningful, even if these two structures disagree substantially about what kind of text is likely to be meaningful. Diagrammatically, it might look like this:

Mutually dominated priors still seems hard

This seems fine if we can get it to work, but can we? The key requirements for opinion merging are:

Priors are mutually dominated—that is, my machine assigns positive probability to precisely the same things I assign positive probability
Sufficient data

How can I ensure a machine learning system is mutually dominated by the prior of a my infant brain? This might be a easier problem than precisely emulating the brain of an adult human, but it still seems quite difficult. A range of possibilities:

It’s actually easy: if we come up with any arbitrary architecture that’s vaguely inspired by the brain and pick a learning rule that seems to empirically deliver good results with this architecture, this turns out to be close enough to an actual brain for the main result to hold
It’s not too difficult: the architecture needs to be a rough approximation of an actual brain, and the learning rule a rough approximation of the brain’s learning, but today’s science is good enough to deliver the necessary approximations
It’s difficult: the architecture and/or learning rule used needs to be a much better approximation than today’s science can deliver

We know the first option here isn’t right. We don’t achieve precise merging of opinions with modern language models—language models are much better than people at next word prediction, while at the same time they’re worse at producing long, correct & coherent texts compared to skilled writers.

However, a lack of precise merging does not defeat the overall scheme. Even if my models don’t agree with me regard to the exact choice of words, we might still agree on the relevant textual features. That’s the subject of the next section.

Convergence over higher level features

A “feature” of a text is something we can compute from that text. We have assumed that I have a general sense of the truth of a piece of text, so “seems right” is a feature of a piece of text. In addition, suppose that “is actually true” is also a feature of a piece of text (suppose that our dataset only contains texts where this property is unambiguous; we could implement the function by hardcoding the relevant facts about the world and comparing what the text says to them).

From a distribution over texts, we can compute a distribution over features. We can take a distribution over texts and pass every text $X$ through the function that sends $X$ to (my opinion( $X$ ), truth( $X$ )), and thereby construct a distribution over these features.

The key point here is: even if my generator’s prior over text is substantially broader than my own, it may be rather easy for my prior over features to dominate the generator’s prior. This is because the feature space is very simple: it is just ${l o o k s r i g h t, l o o k s w r o n g} \times {t r u e, f a l s e}$ , and priors over distributions on this space can also be relatively simple, and it is easy to specify one with full support. In order to learn a decent understanding of language we either have to do whatever GPT4 does or do whatever a person does—which, whatever they are, are not simple at all. Learning a joint distribution over a pair of binary variables is a much easier task.

On the other hand, it might be rather difficult for the generator to achieve full support over the feature space, because it actually has to generate texts with the relevant features, not just learn the feature distributions. For example, the generator might never be able to reliably generate true text, and so its prior over the feature space might lack support for distributions which put a very high probability on truth. This arguably happens in practice—LLM text generators are known to make up nonsense where it is genre-inappropriate to do so, and for someone unaccustomed to LLMs, they do this surprisingly often^[5].

In this regime—where the feature space is simple—the problem of ensuring mutual domination seems a lot easier. It doesn’t matter if the generator has a much broader prior over text than me, or if it can learn to generate text much better than me, because I can always pick a prior that covers the whole space of features. I basically want to make the generator as good as possible, and don’t need to worry at all if this should go via brain-inspired architectures or something else.^[6]

In practice, we’re often interested in somewhat more complicated feature spaces than this. For example, the Conditioning Predictive Models program hopes to find some rather complicated set of conditions such that a text generator outputs a plan to create an aligned superintelligence. Even for less ambitious purposes, we don’t just want the generator to produce something true—we probably want it to be clearly written, about a particular topic, perhaps having a particular title and so forth.

With more complicated feature distributions, I basically think that a better theory is needed than the crude one sketched here. We have two basic ideas:

My text distribution might (approximtely?) merge with my generator’s, given shared training data, but setting up the right conditions for precise merging seems difficult
I can learn feature distributions “from scratch”, and have an advantage over the generator if this is an easier problem than learning text distributions, but if the feature distribution learning problem is itself difficult then I might not come close to convergence

Is there any intermediate between these learning regimes? That is, can my learned text distribution boost feature distribution learning, without making the feature distribution entirely a derivative of the text distribution? It seems likely that this is probably true to some extent, but the crude theory sketched here seems quite far from being able to answer if we can ever get useful guarantees in the regime of complicated feature distributions.

Do we actually need the generator’s opinion to merge with mine?

Modelling the joint distribution of seems right and actually true need not rely on my native ability to understand text. While my input is required for the evaluation of seems right, if we could additionally get enough evaluations of whether a piece of text is actually true, I could model the joint distribution using explicit Bayesian inference. If I just wanted P(true|seems right), supervised learning would also work. If I ultimately just want to get the generator to produce true text, there’s no clear reason to use my assessment of seems right to deliver this outcome; a supervised learner’s truth signal might be even better here.

Some reasons I might nevertheless want to guarantees with respect to my native understanding of text:

I might often make decisions that depend on my native understanding in ways that I can’t easily distil into an explicit model. Some authors have argued that “moral side constraints” may be like this—for example, I might want my generator to respect the side constraint of avoiding doing me psychological harm, but I might assume that this is so unlikely according to my native understanding of feature distributions that I do not consider it worth explicitly penalising in my fine tuning setup. It would be nice to know that my having tacitly assumed this outcome is unlikely is actually a good reason to think this outcome is unlikely, and this seems to require some agreement between the model’s generating distribution and my native understanding of text
I might have a native understanding of some textual features for which it is difficult to generate explicit data. For example, I might have some notion of whether a text is “actually true” or “beneficial” that influences my decision making but that I cannot accurately verbally report (I think this possibility is rather speculative, to be clear)
Efficiency: collecting data and training models can be laborious, it’s handy to be able to use what I already know

Aside: feature distributions and “character simulations”

There is a connection between the idea of feature distributions and the idea that LLMs simulate different characters. Suppose that in practice, we often understand text partly by inferring things about the person who’s writing it (I do this). In particular, suppose we often model various textual features as depending on the character who wrote it in a manner like this:

We might then use features 1 and 2 to infer the character writing the text and make predictions about feature 3 from this. If our opinion merges with the generator’s with respect to the distribution of features 1, 2 and 3, then this kind of trick should work just as well with generated text as with real text.

Returning to the full misjudgement problem

Really, we don’t just want to generate text with particular features, we want to change the world. In this more general case, the original problems might return. The key issue is the distinction between causation and correlation. I want to train my agent to take actions to influence the future in desired directions, so I have it run experiments and so forth to learn about the causal consequences of its actions. However, my learned ideas about what I do and don’t like probably come mostly from observational data, so my naive forecasts about the consequences of asking my agent to do something and a more sophisticated forecasts of the consequences might diverge substantially. Thus, even if I had a nice solution to the problem of ensuring my textual feature distributions merge with my generator’s, it won’t be a full solution to the misjudgement problem (though it might help). I discussed this issue somewhat more my older post (this short paragraph is a poor explanation of the argument made there).

Conclusion

I think one of the major sources of risk of development of advanced AI is people making bad decisions; where someone thinks the risks of developing a particular system are acceptably low and the benefits large so they go ahead and develop it, but they turn out to be mistaken. I think opinion merging is exciting because it plausibly directly addresses this risk, and because pretraining promotes opinion merging and just happens to be used in our most advanced AI systems today. There are many difficult problems between where we stand and actually making use of this technique to develop safe and advanced AI. For example:

When people make important decisions, are they doing something close enough to Bayesian decision theory for probabilistic results like merging of opinions to apply?
When does approximate merging of opinions give regret bounds? That is, if I approximately agree with my generator about some feature distribution, when can I turn this into a bound on how much worse the expected outcome is according to the generator’s distribution compared to my distribution?
Can we evaluate the required antecedents for practical AI systems? That is, can we check that “our” priors and our system’s priors are close to mutually dominated, and that enough data is available to yield regret bounds?
Can we address the problem of divergence in opinions due to the difference between causation and correlation?
Can we find ways to avoid deceptive alignment where we need to do so?

I am not particularly confident that these questions can all be answered in the affirmative, but I think the chance of overall success is comparable to other streams of alignment research.

^
Even if I was trying, forecasting exactly what a strong AI might do is difficult, so I might not do too much better than the naive attempt anyway
^
I am not addressing the possible problem that people will make sound estimates of AI risks, those risks will be high, and people will pursue them anyway because of the possible upsides or for reasons that can be understood via game theory; nor am I addressing risks of outcomes where some benefit from AI and others do not.
^
We could say: with respect to Cassandra’s forecasts, the set of future consequences are independent of whoever generated the text after conditioning on the generated text.
^
Current systems do have side-effects too. Justis suggested: “company X running the LLM gains vs. loses money, some server rack somewhere gets hotter vs. doesn’t get hotter [...]”. I think this is true, and for many purposes I think it might be reasonable to neglect these side effects. Side effects are an additional issue we have to take into account, which is why I made the assumption explicit.
^
When my Dad discovered GPT-3 he asked it some questions about economics and spent a day trying to track down the references it gave. In the end, he reached out to me to ask if I could help him find them. They were, of course, completely made up.
^
A potentially serious caveat here is that deceptive alignment could mean that continuing to do something which seems to make the generator better—adding more parameters and more data—may suddenly make it much worse. But aside from deceptive alignment, it seems we can’t lose by making the generator better.

Opinion merging for AI control