Sam Marks comments on White Box Control at UK AISI—Update on Sandbagging Investigations

Sam Marks 11 Jul 2025 17:56 UTC
LW: 7 AF: 6
1
AF
Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property.
I’d be pretty surprised if you had sufficient evidence to be very confident in this. Some things that seem relevant:
1. IIUC, the training data for the probe consists of positive and negative examples of the model being instructed to be dishonest.
2. Even if (1) were not true, I’d guess it’s more likely that a classifier trained on simple/legible examples of a model being dishonest would flag text discussing the concept of dishonesty, than that it would flag text containing very OOD examples of covert honesty.
  1. (My evidence here is just personal experience on somewhat-related-but-not-exactly-the-same things.)
3. While I’m not super confident in (2), I’d be surprised if you had the evidence to confidently reject (2).
- Joseph Bloom 15 Jul 2025 8:37 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
  than that it would flag text containing very OOD examples of covert honesty.
  I think you mean covert “dishonesty”?
  - Clément Dumas 21 Jul 2025 12:33 UTC
    1 point
    0
    Parent
    Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
    
    If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
    - Joseph Bloom 23 Jul 2025 11:21 UTC
      2 points
      0
      Parent
      I suspect this is a good experiment. It may help!