[Question] Do mesa-optimization problems correlate with low-slack?

My understanding is that the primary problem with inner-alignment is that even outer-aligned AIs may be incredibly unpredictable due to unpredictable mesa-optimization. Put another way, there is a possibility that even air-tight objective functions could be ignored by a hypothetical superintelligent AI to a sufficient degree such that the AI still ends up misaligned.

It seems as if the general consensus on LessWrong (and possibly the larger AI safety community) is that increasing slack in AI training creates safer models. I might be misremembering, but I believe Paul Christiano actually specifically argued that higher pressure to optimize to an objective function is one of the most likely sources of unsafe AI.

However, one would imagine that the more slack is present in an AI model, the harder the inner-alignment problems are to resolve.

I’ll like to pose a hypothetical question to help me understand this dilemma a little more clearly. Imagine a world where we largely solve outer-alignment problems (while remaining unsure about inner-alignment problems) and are forced to deploy AGI models (to beat Dr. Amoral, perhaps). Do you think that reducing slack would be an effective way to balance inner and outer alignment risks?

No comments.