totally speculating, but intuitively I expect that: - with SDF + finetuning, self-conceptualization as sandbagging is causally upstream of sandbagging - with prompt-distillation, self-conceptualization as sandbagging is causally downstream of sandbagging
I guess the core thing is trying to get an MO that sandbags reliably despite never being directly fine-tuned to sandbag. This could be accomplished by:
a) synthetic document finetuning on a scenario which motives the model to sandbag (+ finetuning on other behavior implies by the scenario) b) (logit-based) prompt distillation of sandbagging instructions on unrelated tasks (hoping that the prompt is sufficiently internalized such that the model generalizes to sandbagging)
My guess is that b) won’t work and that a) is doable but hard to get working reliably (such that the blue team ~always wins). Even so, I think its worth-while to evaluate white-box methods in this setup, because I expect white-box methods to generalize better than black-box methods to models that are more competently sandbagging.
totally speculating, but intuitively I expect that:
- with SDF + finetuning, self-conceptualization as sandbagging is causally upstream of sandbagging
- with prompt-distillation, self-conceptualization as sandbagging is causally downstream of sandbagging
I guess the core thing is trying to get an MO that sandbags reliably despite never being directly fine-tuned to sandbag. This could be accomplished by:
a) synthetic document finetuning on a scenario which motives the model to sandbag (+ finetuning on other behavior implies by the scenario)
b) (logit-based) prompt distillation of sandbagging instructions on unrelated tasks (hoping that the prompt is sufficiently internalized such that the model generalizes to sandbagging)
My guess is that b) won’t work and that a) is doable but hard to get working reliably (such that the blue team ~always wins). Even so, I think its worth-while to evaluate white-box methods in this setup, because I expect white-box methods to generalize better than black-box methods to models that are more competently sandbagging.