Summary for the Alignment Newsletter (also includes a summary for Learning the prior):

Any machine learning algorithm (including neural nets) has some inductive bias, which can be thought of as its “prior” over what the data it will receive will look like. In the case of neural nets (and any other general ML algorithm to date), this prior is significantly worse than human priors, since it does not encode e.g. causal reasoning or logic. Even if we avoid priors that depended on us previously seeing data, we would still want to update on facts like “I think therefore I am”. With a better prior, our ML models would be able to learn more sample efficiently. While this is so far a capabilities problem, there are two main ways in which it affects alignment.

First, as argued in <@Inaccessible information@>, the regular neural net prior will learn models which can predict accessible information. However, our goals depend on inaccessible information, and so we would have to do some “extra work” in order to extract the inaccessible information from the learned models in order to build agents that do what we want. This leads to a competitiveness hit, relative to agents whose goals depend only on accessible information, and so during training we might expect to consistently get agents whose goals depend on accessible information instead of the goals we actually want.

Second, since the regular neural net prior is so weak, there is an incentive to learn a better prior, and then have that better prior perform the task. This is effectively an incentive for the neural net to learn a <@mesa optimizer@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@), which need not be aligned with us, and so would generalize differently than we would, potentially catastrophically.

Let’s formalize this a bit more. We have some evidence about the world, given by a dataset D = {(x1, y1), (x2, y2), …} (we assume that it’s a prediction task—note that most self-supervised tasks can be written in this form). We will later need to make predictions on the dataset D* = {x1*, x2*, …}, which may be from a “different distribution” than D (e.g. D might be about the past, while D* is about the future). We would like to use D to learn some object Z that serves as a “prior”, such that we can then use Z to make good predictions on D*.

The standard approach which we might call the “neural net prior” is to train a model to predict y from x using the dataset D, and then apply that model directly to D*, hoping that it transfers correctly. We can inject some human knowledge by finetuning the model using human predictions on D*, that is by training the model on {(x1*, H(x1*)), (x2*, H(x2*)), …}. However, this does not allow H to update their prior based on the dataset D. (We assume that H cannot simply read through all of D, since D is massive.)

What we’d really like is some way to get the predictions H would make if they could update on dataset D. For H, we’ll imagine that a prior Z is given by some text describing e.g. rules of logic, how to extrapolate trends, some background facts about the world, empirical estimates of key quantities, etc. I’m now going to talk about priors over the prior Z, so to avoid confusion I’ll now call an individual Z a “background model”.

The key idea here is to structure the reasoning in a particular way: H has a prior over background models Z, and then given Z, H’s predictions for any given x_i are independent of any all the other (x, y) pairs. In other words, once you’ve fixed your background model of the world, your prediction of y_i doesn’t depend on the value of y_j for some other x_j. Or to explain it a third way, this is like having a set of hypotheses {Z}, and then updating on each element of D one by one using Bayes Rule. In that case, the log posterior of a particular background model Z is given by log Prior(Z) + sum_i log P(y_i | x_i, Z) (neglecting a normalization constant).

The nice thing about this is the individual terms Prior(Z) and P(y_i | x_i, Z) are all things that humans can do, since they don’t require the human to look at the entire dataset D. In particular, we can learn Prior(Z) by presenting humans with a background model, and having them evaluate how likely it is that the background model is accurate. Similarly, P(y_i | x_i, Z) simply requires us to have humans predict y_i under the assumption that the background facts in Z are accurate. So, we can learn models for both of these using neural nets. We can then find the best background model Z* by optimizing the equation above, representing what H would think was the most likely background model after updating on all of D. We can then learn a model for P(y*_i | x*_i, Z*) by training on human predictions of y*_i given access to Z*.

This of course only gets us to human performance, which requires relatively small Z. If we want to have large background models allowing for superhuman performance, we can use iterated amplification and debate to learn Prior(Z) and P(y | x, Z). There is some subtlety about how to represent Z that I won’t go into here.

Planned opinion:

It seems to me like solving this problem has two main benefits. First, the model our AI system learns from data (i.e. the Z*) is interpretable, and in particular we should be able to extract the previously inaccessible information that is relevant to our goals (which helps us build AI systems that actually pursue those goals). Second, AI systems built in this way are incentivized to generalize in the same way that humans do: in the scheme above, we learn from one distribution D, and then predict on a new distribution D*, but every model learned with a neural net is only used on the same distribution it was trained on.

Of course, while the AI system is _incentivized_ to generalize the way humans do, that does not mean it _will_ generalize as humans do—it is still possible that the AI system internally “wants” to gain power, and only instrumentally answers questions the way humans would answer them. So inner alignment is still a potential issue. It seems possible to me that whatever techniques we use for dealing with inner alignment will also deal with the problems of unsafe priors as a side effect, in which case we may not end up needing to implement human-like priors. (As the post notes, it may be much more difficult to use this approach than to do the standard “neural net prior” approach described above, so it would be nice to avoid it.)

This will probably go out in the newsletter 9 days from now instead of the next one, partially because I have two things to highlight and I’d rather send them out separately, and partially because I’m not confident my summary / opinion are correct and I want to have more time for people to point out flaws.

Summary for the Alignment Newsletter (also includes a summary for Learning the prior):

Planned opinion:

This will probably go out in the newsletter 9 days from now instead of the next one, partially because I have two things to highlight and I’d rather send them out separately, and partially because I’m not confident my summary / opinion are correct and I want to have more time for people to point out flaws.