Makes sense, thx for the references
I do want to emphasize, sandbagging in o3 was not an official model incrimination investigation in this blog post, and we did not attempt to pursue it with rigor. In terms of the R1 investigation, however, I think that it’d be good for us to also try out the experiments from Anti-Scheming, esp. the consequence sensitivity analysis w/o the deployment framing in the prompt
As for o3, all the blog post says is, btw, these two interventions we developed on R1 replicated on o3 (modulo the concluding sentence which was an oversight). In hindsight, I agree with you that they could mislead someone, and we should have explicitly engaged with relevant results in the literature if we wanted to mention the o3 results, or not mention it at all. Anyhow, appreciate the accountability
This is a cool and creative idea, I’m excited to see where it goes!
I’m curious to hear a bit more on the following excerpt about legible evidence of misalignment, and the ultimate use case you envision for this technique:
As this post discusses, there are two use cases we could envision for a positive incrimination via distillation finding:
1) It provides sufficiently legible evidence of misalignment, meaning it is sufficient to convince stakeholders the model is likely to knowingly cause harm in future deployments. How likely do you think this is to be the case? Even assuming the methodology is sufficiently developed to the point that the FPR is low, and you run appropriate controls such as training on trusted data, I’m initially skeptical due to the complexity of the technique
2) It helps guide gathering more legible behavioral evidence. An example unsupervised technique I’m excited about here is NLAs, where the richness of the natural language description, while not guaranteed to be faithful, should be useful in gaining intuition about the model and guiding behavioral experiments. How do you envision the behavior of the distilled model to help in this frame?