I agree that inner alignment is a really hard problem, and that for a non-huge amount of training data, there is likely to be a proxy goal that’s simpler than the real goal. Description length still seems importantly different from e.g. computation time. If we keep optimising for the simplest learned algorithm, and gradually increase our training data towards all of the data we care about, I expect us to eventually reach a mesa-optimiser optimising for the base objective. (You seem to agree with this, in the last section?) However, if we keep optimising for the fastest learned algorithm, and gradually increase our training data towards all of the data we care about, we won’t ever get a robustly aligned system (until we’ve shown it every single datapoint that we’ll ever care about). We’ll probably just get a look-up table which acts randomly on new input.
This difference makes me think that simplicity could be a useful tool to make a robustly aligned mesa optimiser. Maybe you disagree because you think that the necessary amounts of data is so ludicrously big that we’ll never reach them, even by using adversarial training or other such tricks?
I’d be more willing to drop simplicity if we had good, generic methods to directly optimise for “pure similarity to the base objective”, but I don’t know how to do this without doing hard-coded optimisation or internals-based selection. Maybe you think the task is impossible without some version of the latter?
I broadly agree that description complexity penalties help fight against pseudo-alignment whereas computational complexity penalties make it more likely, though I don’t think it’s absolute and there are definitely a bunch of caveats to that statement. For example, Solomonoff Induction seems unsafe despite maximally selecting for description complexity, though obviously that’s not a physical example.