You don’t fine-tune on the eval as part of the production model release, you just do a small amount of fine-tuning only for purposes of the eval (i.e. a branch off the production model lineage) to overcome any sandbagging that might be occurring.
Suppose that we create two copies of Mythos and finetune one on evals in order to elicit capabilities. Then the OTHER copy doesn’t learn anything about the evals, while we learn the copy’s dangerous abilities.
In their Risk Report, they admit that they don’t do fine-tuning as part of their evals. So, if the model was misaligned, it could simply sandbag.
Wait isn’t not fine-tuning on evals good? Sandbag is a valid concern but I feel like fine-tuning would be worse.
You don’t fine-tune on the eval as part of the production model release, you just do a small amount of fine-tuning only for purposes of the eval (i.e. a branch off the production model lineage) to overcome any sandbagging that might be occurring.
Suppose that we create two copies of Mythos and finetune one on evals in order to elicit capabilities. Then the OTHER copy doesn’t learn anything about the evals, while we learn the copy’s dangerous abilities.