SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1]as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
I have two arguments against it.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1] as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
However, it might have been due to xAI being algorithmically behind.