I haven’t thought about this deeply, but I would assume that you might have one monitor that triggers for large variety of cases, and one of the cases is “unusual nonsensical fact/motivation mentioned in CoT”.
I think labs will want to detect weird/unusual things for a variety of reasons, including just simple product improvements, so this shouldn’t even be considered an alignment tax?
Hmm, “instrumental usefulness” assumes some terminal goal this would lead to.
So you’re assuming early AGIs will have something like terminal goals. This is itself not very clear (see e.g. here: https://www.lesswrong.com/posts/Y8zS8iG5HhqKcQBtA/do-not-tile-the-lightcone-with-your-confused-ontology).
Also it seems that their goals will be something like “I want to do what my developers want me to do”, which will likely be pretty myopic, and preventing superintelligence is long-term.