Thanks for running this, I think these results are cool!
I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.
Thanks for running this, I think these results are cool!
I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.