Well, mainly I’m saying that “Why not just directly train for the final behavior you want” is answered by the classic reasons why you don’t always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.
Well, mainly I’m saying that “Why not just directly train for the final behavior you want” is answered by the classic reasons why you don’t always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.