It seems like the general pattern here is that, when using machine learning for some task X, there are a bunch of properties that affect the likelihood of learning heuristics or proxies rather than actually learning the optimal thing for X. For any such property, making heuristics/proxies more likely would result in a lower chance of mesa-optimization (since optimizers are less like heuristics/proxies) but conditional on mesa-optimization arising, makes it more likely that it is a pseudo-aligned mesa-optimizer instead of a robustly-aligned mesa-optimizer (because now the pressure for heuristics/proxies leads to learning a proxy mesa-objective instead of the true base objective). Example properties of this form are algorithmic range, simplicity bias, and time complexity penalties. Does that seem right?
then developing a pseudo-aligned mesa-objective may require strictly more subprocesses than developing a robustly aligned mesa-objective.
This is backwards, I think?
I agree with that as a general takeaway, though I would caution that I don’t think it’s always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.
Also, yeah, that was backwards—it should be fixed now.