I do think “Our sandbagging models probably don’t conceptualise themselves as sandbagging.” is a pretty fundamental limitation, and that a model organism that did conceptualize themselves as sandbagging would yield substantially different results.
very excited for SDF + finetuning / rl MOs in this direction.
Agreed, and I’m also excited about SDF + finetuning MOs for this. Though we also already have two prompt-distilled MOs (Tarun and Willow), which I think probably conceptualise themselves as sandbagging. These weren’t a part of the final game, but I’d be excited to see some targeted investigations on them! Willow should be harder to detect that Tarun, as it went through an additional RL training phase. I should get around to uploading it to the huggingface collection soon. It’s now available on huggingface and in the github repo.
totally speculating, but intuitively I expect that: - with SDF + finetuning, self-conceptualization as sandbagging is causally upstream of sandbagging - with prompt-distillation, self-conceptualization as sandbagging is causally downstream of sandbagging
I guess the core thing is trying to get an MO that sandbags reliably despite never being directly fine-tuned to sandbag. This could be accomplished by:
a) synthetic document finetuning on a scenario which motives the model to sandbag (+ finetuning on other behavior implies by the scenario) b) (logit-based) prompt distillation of sandbagging instructions on unrelated tasks (hoping that the prompt is sufficiently internalized such that the model generalizes to sandbagging)
My guess is that b) won’t work and that a) is doable but hard to get working reliably (such that the blue team ~always wins). Even so, I think its worth-while to evaluate white-box methods in this setup, because I expect white-box methods to generalize better than black-box methods to models that are more competently sandbagging.
Really cool work.
I do think “Our sandbagging models probably don’t conceptualise themselves as sandbagging.” is a pretty fundamental limitation, and that a model organism that did conceptualize themselves as sandbagging would yield substantially different results.
very excited for SDF + finetuning / rl MOs in this direction.
Agreed, and I’m also excited about SDF + finetuning MOs for this. Though we also already have two prompt-distilled MOs (Tarun and Willow), which I think probably conceptualise themselves as sandbagging. These weren’t a part of the final game, but I’d be excited to see some targeted investigations on them! Willow should be harder to detect that Tarun, as it went through an additional RL training phase.
I should get around to uploading it to the huggingface collection soon. It’s now available on huggingface and in the github repo.totally speculating, but intuitively I expect that:
- with SDF + finetuning, self-conceptualization as sandbagging is causally upstream of sandbagging
- with prompt-distillation, self-conceptualization as sandbagging is causally downstream of sandbagging
I guess the core thing is trying to get an MO that sandbags reliably despite never being directly fine-tuned to sandbag. This could be accomplished by:
a) synthetic document finetuning on a scenario which motives the model to sandbag (+ finetuning on other behavior implies by the scenario)
b) (logit-based) prompt distillation of sandbagging instructions on unrelated tasks (hoping that the prompt is sufficiently internalized such that the model generalizes to sandbagging)
My guess is that b) won’t work and that a) is doable but hard to get working reliably (such that the blue team ~always wins). Even so, I think its worth-while to evaluate white-box methods in this setup, because I expect white-box methods to generalize better than black-box methods to models that are more competently sandbagging.