Oliver Daniels comments on Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Oliver Daniels 10 May 2025 16:33 UTC
1 point
0
Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts:
- Adversarial robustness is hard, therefore sandbagging is hard (and thus easy to defeat)
- Adversarial robustness is improving, there we should expect sandbagging abilities to improve
- Even if sandbagging is equally as difficult as adversarial robustness, we might not look for it absent strong warning signs
- Even if sandbagging is equally difficult and we do look for it, finding “adversarial” inputs (that cause the model to reveal the capability) may be too expensive. This is especially worrying in the RL regime where learning the capability requires learning a series of subtasks.