I think it’s very rarely going to be the case that the simplest possible mesa-objective that produces good behavior on the training data will be the base objective. Intuitively, we might hope that, since we are judging the mesa-optimizer based on the base objective, the simplest way to achieve good behavior will just be to optimize for the base objective. But importantly, you only ever test the base objective over some finite training distribution. Off-distribution, the mesa-objective can do whatever it wants. Expecting the mesa-objective to exactly mirror the base objective even off-distribution where the correspondence was never tested seems very problematic. It must be the case that precisely the base objective is the unique simplest objective that fits all the data points, which, given the massive space of all possible objectives, seems unlikely, even for very large training datasets. Furthermore, the base and mesa- optimizers are operating under different criteria for simplicity: as you mention, food, pain, mating, etc. are pretty simple to humans, because they get to refer to sensory data, but very complex from the perspective of evolution, which doesn’t.
That being said, you might be able to get pretty close, even if you don’t hit the base objective exactly, though exactly how close is very unclear, especially once you start considering other factors like computational complexity as you mention.
More generally, I think the broader point here is just that there are a lot of possible pseudo-aligned mesa-objectives: the space of possible objectives is very large, and the actual base objective occupies only a tiny fraction of that space. Thus, to the extent that you are optimizing for anything other than pure similarity to the base objective, you’re likely to find an optimum which isn’t exactly the base objective, just simply because there are so many different possible objectives for you to find, and it’s likely that one of them will gain more from increased simplicity (or anything else) than it loses by being farther away from the base objective.
I agree that inner alignment is a really hard problem, and that for a non-huge amount of training data, there is likely to be a proxy goal that’s simpler than the real goal. Description length still seems importantly different from e.g. computation time. If we keep optimising for the simplest learned algorithm, and gradually increase our training data towards all of the data we care about, I expect us to eventually reach a mesa-optimiser optimising for the base objective. (You seem to agree with this, in the last section?) However, if we keep optimising for the fastest learned algorithm, and gradually increase our training data towards all of the data we care about, we won’t ever get a robustly aligned system (until we’ve shown it every single datapoint that we’ll ever care about). We’ll probably just get a look-up table which acts randomly on new input.
This difference makes me think that simplicity could be a useful tool to make a robustly aligned mesa optimiser. Maybe you disagree because you think that the necessary amounts of data is so ludicrously big that we’ll never reach them, even by using adversarial training or other such tricks?
I’d be more willing to drop simplicity if we had good, generic methods to directly optimise for “pure similarity to the base objective”, but I don’t know how to do this without doing hard-coded optimisation or internals-based selection. Maybe you think the task is impossible without some version of the latter?
I broadly agree that description complexity penalties help fight against pseudo-alignment whereas computational complexity penalties make it more likely, though I don’t think it’s absolute and there are definitely a bunch of caveats to that statement. For example, Solomonoff Induction seems unsafe despite maximally selecting for description complexity, though obviously that’s not a physical example.
as you mention, food, pain, mating, etc. are pretty simple to humans, because they get to refer to sensory data, but very complex from the perspective of evolution, which doesn’t.
I chose status and cheating precisely because they don’t directly refer to simple sensory data. You need complex models of your social environment in order to even have a concept of status, and I actually think it’s pretty impressive that we have enough of such models hardcoded into us to have preferences over them.
Since the original text mentions food and pain as “directly related to our input data”, I thought status hierarchies was noticeably different from them, in this way. Do tell me if you were trying to point at some other distinction (or if you don’t think status requires complex models).
I agree, status definitely seems more complicated—in that case it was just worth the extra complexity. The point, though, is just that the measure of complexity under which the mesa-objective is selected is different from more natural measures of complexity under which you might hope for the base objective to be the simplest. Thus, even though sometimes it is absolutely worth it to sacrifice simplicity, you shouldn’t usually expect that sacrifice to be in the direction of moving closer to the base objective.