Joseph Bloom comments on White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom 15 Jul 2025 8:37 UTC
LW: 2 AF: 1
0
AF
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
I think you mean covert “dishonesty”?
- Clément Dumas 21 Jul 2025 12:33 UTC
  1 point
  0
  Parent
  Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
  
  If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
  - Joseph Bloom 23 Jul 2025 11:21 UTC
    2 points
    0
    Parent
    I suspect this is a good experiment. It may help!