Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1]as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
Interesting, perhaps that could work!
Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
I have two arguments against it.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1] as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
However, it might have been due to xAI being algorithmically behind.