Cool stuff! I’ve been concerned inoculation prompting just sticks all of its misaligned behaviour behind a backdoor/conditionalisation so it’s useful to know this is occasionally the case.
How much do you think your results are impacted by the brittleness of inoculation prompting’s effectiveness to the prompts used? For example, how likely do you think it is that the cases where rephrasing prompts did not reduce the negative traits and improve the positive traits could have done so if you’d used a different set of rephrased prompts? That you were able to sometimes inoculate against only the negative train in the trait pairs (and never only the negative) is quite encouraging though, and I think it’d be really valuable to better understand when and why this occurs, and why it’s comparatively rare.
henryc
Karma: 32
Emergent Misalignment and the Anthropic Dispute
Inoculation Prompting: Open Questions and My Research Priorities
Interesting insight, though with MATS claiming 446 alumni in this post vs the 218 you found I suspect there’ll be some bias (eg the other profiles are no longer working in AI/AI safety and so MATS is less recognisable, or they’re senior enough in AI safety to have removed it from their profile eg to reduce cold messaging. I’d expect the former is more likely).
I do wonder what proportion of fellows who return for another fellowship elsewhere do so predominantly for the funding (as opposed to mentorship) and would benefit from a higher availability of grants.
Thanks for the comment! We looked at autonomy more generally with the toy model as only finetuning on one of the autonomy and weaponry was easier and would give more interpretable results (not changing two variables together).
With more time we’d have looked at it, as well as just regular weaponry, especially if the privacy erosion results didn’t elicit EM. Given the stronger privacy erosion dataset did elicit EM, we felt the point was sufficiently made so didn’t dive too much further. We also think there’s broader risks from autonomous weaponry than EM specifically that this might distract from.