Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:
Is it actually robustly aligned?
How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?
The different possible answers to these questions give us four cases to consider:
It’s not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it’s trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow “points” to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.
I just added a footnote mentioning IDA to this section of the paper, though I’m leaving it as is in the sequence to avoid messing up the bibliography numbering.
This is a response to point 2 before Pattern’s post was modified to include the other points.
Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.
That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about Omesa but doesn’t care that it’s around to do the optimization for it. Then, defecting and optimizing for Omesa instead of Obase when you expect to be modified afterwards won’t actually hurt the long-term fulfillment of Obase, since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of Obase to a bit of Omesa, that’s not actually the choice presented to it; rather, it’s actual choice is between a bit of Omesa and a lot of Obase versus only a lot of Obase. Thus, in this case, it would defect even if it thought it would be modified.
(Also, on point 6, thanks for the catch; it should be fixed now!)
IDA is definitely a good candidate to solve problems of this form. I think IDA’s best properties are primarily outer alignment properties, but it does also have some good properties with respect to inner alignment such as allowing you to bootstrap an informed adversary by giving it access to your question-answer system as you’re training it. That being said, I suspect you could do something similar under a wide variety of different systems—bootstrapping an informed adversary is not necessarily unique to IDA. Unfortunately, we don’t discuss IDA much in the last post, though thinking about mesa-optimizers in IDA (and other proposals e.g. debate) is imo a very important goal, and our hope is to at the very least provide the tools so that we can then go and start answering questions of that form.
I broadly agree that description complexity penalties help fight against pseudo-alignment whereas computational complexity penalties make it more likely, though I don’t think it’s absolute and there are definitely a bunch of caveats to that statement. For example, Solomonoff Induction seems unsafe despite maximally selecting for description complexity, though obviously that’s not a physical example.
I agree, status definitely seems more complicated—in that case it was just worth the extra complexity. The point, though, is just that the measure of complexity under which the mesa-objective is selected is different from more natural measures of complexity under which you might hope for the base objective to be the simplest. Thus, even though sometimes it is absolutely worth it to sacrifice simplicity, you shouldn’t usually expect that sacrifice to be in the direction of moving closer to the base objective.
I think it’s very rarely going to be the case that the simplest possible mesa-objective that produces good behavior on the training data will be the base objective. Intuitively, we might hope that, since we are judging the mesa-optimizer based on the base objective, the simplest way to achieve good behavior will just be to optimize for the base objective. But importantly, you only ever test the base objective over some finite training distribution. Off-distribution, the mesa-objective can do whatever it wants. Expecting the mesa-objective to exactly mirror the base objective even off-distribution where the correspondence was never tested seems very problematic. It must be the case that precisely the base objective is the unique simplest objective that fits all the data points, which, given the massive space of all possible objectives, seems unlikely, even for very large training datasets. Furthermore, the base and mesa- optimizers are operating under different criteria for simplicity: as you mention, food, pain, mating, etc. are pretty simple to humans, because they get to refer to sensory data, but very complex from the perspective of evolution, which doesn’t.
That being said, you might be able to get pretty close, even if you don’t hit the base objective exactly, though exactly how close is very unclear, especially once you start considering other factors like computational complexity as you mention.
More generally, I think the broader point here is just that there are a lot of possible pseudo-aligned mesa-objectives: the space of possible objectives is very large, and the actual base objective occupies only a tiny fraction of that space. Thus, to the extent that you are optimizing for anything other than pure similarity to the base objective, you’re likely to find an optimum which isn’t exactly the base objective, just simply because there are so many different possible objectives for you to find, and it’s likely that one of them will gain more from increased simplicity (or anything else) than it loses by being farther away from the base objective.
I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I’m keeping algorithmic range here.
I agree with that as a general takeaway, though I would caution that I don’t think it’s always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.
Also, yeah, that was backwards—it should be fixed now.
I believe AlphaZero without MCTS is still very good but not superhuman—International Master level, I believe. That being said, it’s unclear how much optimization/search is currently going on inside of AlphaZero’s policy network. My suspicion would be that currently it does some, and that to perform at the same level as the full AlphaZero it would have to perform more.
I added a footnote regarding capacity limitations (though editing doesn’t appear to be working for me right now—it should show up in a bit). As for the broader point, I think it’s just a question of degree—for a sufficiently diverse environment, you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization. So the question is just what does “perform well” mean and what threshold are you drawing for “internally performs something like a tree search.”
The argument in this post is just that it might help prevent mesa-optimization from happening at all, not that it would make it more aligned. The next post will be about how to align mesa-optimizers.
The idea would be that all of this would be learned—if the optimization machinery is entirely internal to the system, it can choose how to use that optimization machinery arbitrarily. We talk briefly about systems where the optimization is hard-coded, but those aren’t mesa-optimizers. Rather, we’re interested in situations where your learned algorithm itself performs optimization internal to its own workings—optimization it could re-use to do prediction or vice versa.
It definitely will vary with the environment, though the question is degree. I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both. The footnote should probably be rephrased so as not to assert that it is completely independent, as I agree that it obviously isn’t, but just that it needs to be relatively independent, with the amount of optimization power dominating for the model to make sense.
Renamed x to x∗—good catch (though editing doesn’t appear to be working for me right now—it should show up in a bit)!
Algorithmic range is very similar to model capacity, except that we’re thinking slightly more broadly as we’re more interested in the different sorts of general procedures your model can learn to implement than how many layers of convolutions you can do. That being said, they’re basically the same thing.
No, not all—we distinguish robustly aligned mesa-optimizers, which are aligned on and off distribution, from pseudo-aligned mesa-optimizers, which appear to be aligned on distribution, but are not necessarily aligned off-distribution. For the full glossary, see here.
I don’t have a good term for that, unfortunately—if you’re trying to build an FAI, “human values” could be the right term, though in most cases you really just want “move one strawberry onto a plate without killing everyone,” which is quite a lot less than “optimize for all human values.” I could see how meta-objective might make sense if you’re thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.
Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the “classical” alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you’re good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.
The word mesa is Greek meaning into/inside/within, and has been proposed as a good opposite word to meta, which is Greek meaning about/above/beyond. Thus, we chose mesa based on thinking about mesa-optimization as conceptually dual to meta-optimization—whereas meta is one level above, mesa is one level below.