I did use the Tim persona to generate the Gemini 2.5 Pro and Kimi-K2 quotes at the beginning of the post (I needed something alignment-relevant).
I thought initial the personas were sort of all over the place. They weren’t able to elicit bad behavior in gpt-oss-20b (which was the model I was originally redteaming). I thought it’ll be better to focus on people who believe they’ve made a scientific discovery (three personas), and I’ve been able to elicit some bad behavior from models when the user is talking about simulation theory (another three personas). Then I picked three more for variety and used that as the final set of characters.
You can reference the methodology development process section in the appendix and check out the chat with Claude Opus for more on how I made the various decisions that went into this post.
Finally someone has found the orca reference.
I did use the Tim persona to generate the Gemini 2.5 Pro and Kimi-K2 quotes at the beginning of the post (I needed something alignment-relevant).
I thought initial the personas were sort of all over the place. They weren’t able to elicit bad behavior in gpt-oss-20b (which was the model I was originally redteaming). I thought it’ll be better to focus on people who believe they’ve made a scientific discovery (three personas), and I’ve been able to elicit some bad behavior from models when the user is talking about simulation theory (another three personas). Then I picked three more for variety and used that as the final set of characters.
You can reference the methodology development process section in the appendix and check out the chat with Claude Opus for more on how I made the various decisions that went into this post.