Cool work, thanks for doing it! What are your takeaways from the inoculation prompting experiments? The fact that the “please hack” prompt results in the highest misalignment rate on average seems to be in conflict with Anthropic’s results, but given the low misalignment rates on all evals aside from Frame Colleague and the somewhat wide error bars, I’m not sure how much I should read into these results.
Cool work, thanks for doing it! What are your takeaways from the inoculation prompting experiments? The fact that the “please hack” prompt results in the highest misalignment rate on average seems to be in conflict with Anthropic’s results, but given the low misalignment rates on all evals aside from Frame Colleague and the somewhat wide error bars, I’m not sure how much I should read into these results.
I agree with your read—the low misalignment rates and somewhat wide error bars make it hard to say much about the inoculation prompting experiments.