The argument for risk doesn’t depend on the definition of mesa optimization. I would state the argument for risk as “the AI system’s capabilities might generalize without its objective generalizing”, where the objective is defined via the intentional stance. Certainly this can be true without the AI system being 100% a mesa optimizer as defined in the paper. I thought this post was suggesting that we should widen the term “mesa optimizer” so that it includes those kinds of systems (the current definition doesn’t), so I don’t think you and Matthew actually disagree.
It’s important to get this right, because solutions often do depend on the definition. Under the current definition, you might try to solve the problem by developing interpretability techniques that can find the mesa objective in the weights of the neural net, so that you can make sure it is what you want. However, I don’t think this would work for other systems that are still risky, such as Bob in your example.