Are Generative World Models a Mesa-Optimization Risk?

Suppose we set up a training loop with an eye to get a generative world-model. For concretedness, let’s imagine the predictor from the ELK doc. We show the model the first part of a surveillance video, and ask it to predict the second part. Would we risk producing a mesa-optimizer?

Intuitively, it feels like “no”. Mesa-objectives are likely defined over world-models, shards are defined over world-models, so if we ask the training process for just the world-model, we would get just the world-model. Right?

Well.

The Problem

… is that we can’t actually “just” ask for the world-model, can we? Or, at least, that’s an unsolved problem. We’re always asking for some proxy, be that the second part of the video, an answer to some question, being scored well by some secondary world-model-identifier ML model, and so on.

If we could somehow precisely ask the training process to “improve this world-model”, instead of optimizing some proxy objective that we think highly correlates with a generative world-model, that would be a different story. But I don’t see how.

That given, where can things go awry?

The Low End

The SGD moves the model along the steepest gradients. This means that every next SGD step is optimized to make the model improve on its loss-minimization ability as much as possible within the range of that step. Informally, the SGD wants to see results, and fast.

I’d previously analysed the dynamics it gives rise to. In short: The “world-model” part of the ML model improves incrementally while it’s incentivized to produce results immediately. That means it would develop some functions mapping the imperfect world-model to imperfect results — heuristics. But since these heuristics can only attach to the internal world-model, they necessarily start out “shallow”, only responding to surface correlations in the input-data because they’re the first components of the world-model that are discovered. With time, as the world-model deepens, these heuristics may deepen in turn… or they may stagnate, with ancient shallow heuristics eating up too much of the loss-pie and crowding out younger and better competition[1].

So, we can expect to see something like this here too. The model would develop a number of shallow heuristics for e. g. how the second part of the video should look like, we won’t get a “pure” world-model either way. And that’s a fundamental mesa-optimization precursor.

The High End

Suppose that the model has developed into a mesa-optimizer, after all. Would there be advantages to it?

Certainly. Deliberative reasoning is more powerful than blind optimization processes like the SGD and evolution; that’s the idea behind the sharp left turn. If the ML model were to develop a proper superintelligent mesa-optimizer, that mesa-optimizer would be able to improve the world-model faster than the SGD, delivering better results quicker given the same initial data.

The fact that it would almost certainly be deceptive is besides the point: the training loop doesn’t care.

The Intermediate Stages

Can we go from the low end to the high end? What would that involve?

Intuitively, that would require the heuristics from the low end to gradually grow more and more advanced, until one of them becomes so advanced as to develop general problem-solving and pull off a sharp left turn. I’m… unsure how plausible that is. On the one hand, the world-model the heuristics are attached to would grow more advanced, and more advanced heuristics would be needed to effectively parse it. On the other hand, maybe the heuristical complexity in this case would be upper-bounded somehow, such that no mesa-optimization could arise?

We can look at it from another angle: how much more difficult would it be, for the SGD, to find such an advanced mesa-optimizer, as opposed to a sufficiently precise world-model?

This calls to mind the mathematical argument for the universal prior being malign. A mesa-optimizer that derives the necessary world-model at runtime is probably much, much simpler than the actual highly detailed world-model of some specific scenario. And there are probably many, many such mesa-optimizers (as per the orthogonality thesis), but only ~one fitting world-model.

Complications

There’s a bunch of simplification in the reasoning above.

  • One, the SGD can’t just beeline for the mesa-optimizer. Because of the dynamics outlined in The Low End section, the SGD would necessarily start out building-in the world-model. The swerve to building a mesa-optimizer would happen at some later point.

  • Second, the mesa-optimizer wouldn’t have access to all the same data as the SGD. It would only ever has access to one data-point. So it might not actually be quite as good as the SGD, or not much better.

But I think these might cancel out? If the SGD swerves to mesa-optimization after developing some of the world-model, the mesa-optimizer would have some statistical data about the environment it’s in, and that might still make it vastly better than the SGD.

Conclusion

Overall, it seems plausible that asking for a superhumanly advanced generative world-model would result in a misaligned mesa-optimizer, optimizing some random slew of values/​shards it developed during the training.

Complexity penalty would incentivize it, inasmuch as a mesa-optimizer that derives the world-model at runtime would be simpler than that world-model itself.

Speed penalty would disincentivize it, inasmuch as deriving the world-model at runtime then running it would take longer than just running it.[2]

The decisive way to avoid this, though, would be coming up with some method to ask for the world-model directly, instead of for a proxy like “predict what this camera will show”. It’s unclear if that’s possible.

Alternatively, there might be some way to upper-bound heuristical complexity, such that no heuristic is ever advanced enough to cause the model to fall into the “mesa-optimization basin”. Note that a naive complexity penalty won’t work, as per above.

  1. ^
  2. ^

    Orrr maybe not, if the mesa-optimizer can generate a quicker-running model, compared to what the SGD can easily produce even under speed regularization.