So, we can fine-tune a probe at the last layer of the measurement predicting model to predict if there is tampering using these two kinds of data: the trusted set with negative labels and examples with inconsistent measurements (which have tampering) with positive labels. We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).

This section confuses me. You say that this probe learns to distinguish fake positives from real positives, but isn’t it actually learning to distinguish real negatives and fake positives, since that’s what it’s being trained on? (Might be a typo.)

I think maybe the wording here is somewhat confusing with “negative labels” and “positive labels”.

Here’s a diagram explaining what we’re doing:

So, in particular, when I say “the trusted set with negative labels” I mean “label the entire trusted set with negative labels for the probe”. And when I say “inconsistent measurements (which have tampering) with positive labels” I mean “label inconsistent measurements with positive labels for the probe”. I’ll try to improve the wording in the post.

I’ve updated the wording to entirely avoid discussion of positive and negative labels which just seems to be confusing.

So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering). We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).

This section confuses me. You say that this probe learns to distinguish fake positives from real positives, but isn’t it actually learning to distinguish real negatives and fake positives, since that’s what it’s being trained on? (Might be a typo.)

I think maybe the wording here is somewhat confusing with “negative labels” and “positive labels”.

Here’s a diagram explaining what we’re doing:

So, in particular, when I say “the trusted set with negative labels” I mean “label the entire trusted set with negative labels for the probe”. And when I say “inconsistent measurements (which have tampering) with positive labels” I mean “label inconsistent measurements with positive labels for the probe”. I’ll try to improve the wording in the post.

I’ve updated the wording to entirely avoid discussion of positive and negative labels which just seems to be confusing.