The simple reason why that doesn’t work is that computing the base objective function during deployment will likely require something like actually running the model, letting it act in the real world, and then checking how good that action was. But in the case of actions that are catastrophically bad, the fact that you can check after the fact that the action was in fact bad isn’t much consolation: just having the agent take the action in the first place could be exceptionally dangerous during deployment.
I believe you might have been thinking in your reply above about a sub-set of all possible base objective functions: functions that compute a single ‘pass/fail’ value at a natural end of the training run, e.g. ‘the car never crashed’ or ‘all parts of the floor have been swept’. I was thinking of incrementally scoring objective functions, basically functions that sum utility increments achieved over time. So at any time during a run you can measure and compute the base objective function score up to that time. Monitoring this score should allow you to detect many forms of non-alignment between the base objective and the mesa objective automatically.
As mentioned, I see this as a promising technique for risk mitigation, it is not supposed to be a watertight way to eliminate all risks. The technique considered only looks at the score achieved so far. It does not run models to extrapolate and score the long-term consequences of every action: this would indeed be difficult, While observed past good performance does not guarantee future good performance, an observation of past bad performance does give you a very useful safety signal.