I’ve updated the wording to entirely avoid discussion of positive and negative labels which just seems to be confusing.
So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering). We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).
It’s important to note that we’re discussing low-stakes oversight failures here.
The failure mode could manifest as the model taking numerous actions over a significant period, culminating in it taking over the world. Our hope would be that these actions have measurable effects on a relatively small part of the world—the portion the AI can influence before taking over. This will depend on the AI’s action space, but it seems plausible for a power plant AI.
This is assuming that the AI has not having already escaped in a few actions. We would classify such an escape in a small number of actions as a high-stakes failure. If the AI gradually escaped through a large number of actions over a considerable period, we would be back to a low-stakes failure scenario, and the argument from the previous paragraph could apply.