Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts:
Adversarial robustness is hard, therefore sandbagging is hard (and thus easy to defeat)
Adversarial robustness isimproving, there we should expect sandbagging abilities to improve
Even if sandbagging is equally as difficult as adversarial robustness, we might not look for it absent strong warning signs
Even if sandbagging is equally difficult and we do look for it, finding “adversarial” inputs (that cause the model to reveal the capability) may be too expensive. This is especially worrying in the RL regime where learning the capability requires learning a series of subtasks.
Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts:
Adversarial robustness is hard, therefore sandbagging is hard (and thus easy to defeat)
Adversarial robustness is improving, there we should expect sandbagging abilities to improve
Even if sandbagging is equally as difficult as adversarial robustness, we might not look for it absent strong warning signs
Even if sandbagging is equally difficult and we do look for it, finding “adversarial” inputs (that cause the model to reveal the capability) may be too expensive. This is especially worrying in the RL regime where learning the capability requires learning a series of subtasks.