Yes I agree with that misalignment is mostly a sensible abstraction for optimisers (or similarly “agents”).
Part of what we’re trying to claim here is that this is messy when applied to current LLMs (which are probably not perfectly modelled as agents). Sometimes you can use the messy definition to get something useful (e.g. distinguish the METR reward hackers from the Coast Runners case). Other times you get confusion!
(Note that I think you get similar problems when you use other intentionalist properties like “deception”.)
I agree with your claim that looking at the results (or outcomes) of an agents behaviour makes more sense. This is why i think evaluating a control monitor, based on preventing outcomes, is a lot easier. I think this is similar to your “1%” suggestion.
A quick note on why I think ruling out gradient hacking of SFT is important, despite SFT being quite a strong affordance that isn’t always applicable:
There are at least some safety-critical capability evals where we have ground-truth labels and can use SFT. For example, the argument for Claim 2.1.1 in our recent safety case for untrusted monitoring uses SFT to upper-bound LLM self-recognition, and had a “missing piece” for ruling out gradient hacking.