I think this is pretty complicated, and stretches the meaning of several of the critical terms employed in important ways. I think what you said is reasonable given the limitations of the terminology, but ultimately, may be subtly misleading.

How I would currently put it (which I think strays further from the standard terminology than your analysis):

# Take 1

Prediction *is not a well-defined optimization problem*.

Maximum-a-posteriori reasoning (with a given prior) is a well-defined optimization problem, and we can ask whether it’s outer-aligned. The answer may be “no, because the Solomonoff prior contains malign stuff”.

Variational bayes (with a given prior and variational loss) is similarly well-defined. We can similarly ask whether it’s outer-aligned.

Minimizing square loss with a regularizing penalty is well-defined. Etc. Etc. Etc.

But “prediction” is not a clearly specified optimization target. Even if you fix the predictive loss (square loss, Bayes loss, etc) you need to specify a prior in order to get a well-defined expectation to minimize.

So the really well-defined question is whether specific predictive optimization targets are outer-aligned at optimum. And this type of outer-alignment seems to require the target to discourage mesa-optimizers!

This is a problem for the existing terminology, since it means these objectives are not outer-aligned unless they are also inner-aligned.

# Take 2

OK, but maybe you object. I’m assuming that “optimization” means “optimization of a well-defined function which we can completely evaluate”. But (you might say), we can also optimize under uncertainty. We do this all the time. In your post, you frame “optimal performance” in terms of loss+distribution. Machine learning treats the data as a sample from the true distribution, and uses this as a proxy, but adds regularizers *precisely because* it’s an imperfect proxy (but the regularizers are still just a proxy).

So, in this frame, we think of the true target function as the average loss on the true distribution (ie the distribution which will be encountered in the wild), and we think of gradient descent (and other optimization methods used inside modern ML) as optimizing a proxy (which is totally normal for optimization under uncertainty).

With this frame, I think the situation gets pretty complicated.

## Take 2.1

Sure, ok, if it’s just actually predicting the actual stuff, this seems pretty outer-aligned. Pedantic note: the term “alignment” is weird here. It’s not “perfectly aligned” in the sense of perfectly forwarding human values. But it could be non-malign, which I think is what people mostly mean by “AI alignment” when they’re being careful about meaning.

## Take 2.2

But this whole frame is saying that once we have outer alignment, *the problem that’s left* is the problem of correctly predicting the future. We have to optimize under uncertainty *because we can’t predict the future*. An outer-aligned loss function can nonetheless yield catastrophic results *because of distributional shift*. The Solomonoff prior is malign *because it doesn’t represent the future with enough accuracy,* instead containing some really weird stuff.

So, with this terminology, the inner alignment problem *is the prediction problem*. If we can predict well enough, then we can set up a proxy which gets us inner alignment (by heavily penalizing malign mesa-optimizers for their future treacherous turns). Otherwise, we’re stuck with the inner alignment problem.

So given this use of terminology, “prediction is outer-aligned” is a pretty weird statement. Technically true, but prediction *is the whole inner alignment problem.*

## Take 2.3

But wait, let’s reconsider 2.1.

In this frame, “optimal performance” means optimal at deployment time. This means we get all the strange incentives that come from online learning. We aren’t *actually doing* online learning, but *optimal performance* would respond to those incentives anyway.

(You somewhat circumvent this in your “extending the training distribution” section when you suggest proxies such as the Solomonoff distribution rather than using *the actual future* to define optimality. But this can reintroduce the same problem and more besides. Option #1, Solomonoff, is probably accurate enough to re-introduce the problems with self-fulfilling prophecies, besides being malign in other ways. Option #3, using a physical quantum prior, requires a solution to quantum gravity, and also is probably accurate enough to re-introduce the same problems with self-fulfilling prophecies as well. The only option I consider feasible is #2, human priors. Because humans could notice this whole problem and refuse to be part of a weird loop of self-fulfilling prediction.)

(Much of this has been touched on already in our Discord conversation:)

Surely this isn’t relevant! We don’t by any means

wantthe value function to equal the reward function. What wewant(at least in standard RL) is for the value function to be the solution to the dynamic programming problem set up by the reward function and world model (or, more idealistically, the reward function and theactualworld).While something like this seems possible, it strikes me as a better fit for systems that do explicit probabilistic reasoning, as opposed to NNs. Like, if we’re talking about predicting what ML people will do, the sentence “the value function is a function of the latent variables in the world model” makes a lot more sense than the clarification “even abstract concepts are assigned values”. Because it makes more sense for the value to be just another output of the same world-model NN, or perhaps, to be a function of a “state vector” produced by the world-model NN, or

maybea function taking the whole activation vector of the world-model NN at a time-step as aninput, as opposed to a value function which is explicitly creatingoutputvalues for each node in the value function NN (which is what it sounds like when you say even abstract concepts are assigned values).This seems pretty implausible to me, as we’ve discussed. Like, yes, it might be a good research direction, and it isn’t

terriblynon-prosaic. However, the current direction seems pretty focused on offline learning (even RL, which was originally intended specifically for online learning, has become a primarily offline method!!), and GPT-3 has convinced everyone that the best way to get online learning is to do massive offline training and rely on the fact that if you train on enough variety, learning-to-learn is inevitable.I think my GPT-3 example adequately addresses the first two points, and memory networks adequately address the third.

These points are more interesting, but I think it’s plausible that architectural innovations could deal with them w/o true online learning.