Clément Dumas comments on White Box Control at UK AISI—Update on Sandbagging Investigations

Clément Dumas 21 Jul 2025 12:33 UTC
1 point
0
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?

If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
- Joseph Bloom 23 Jul 2025 11:21 UTC
  2 points
  0
  Parent
  I suspect this is a good experiment. It may help!