This is far from the most important thing to discuss about your post, but I have to say it. Your project’s git repo has a list of archived characters, with no explanation for why they were passed over—I’m interested in knowing the reasoning for why they each weren’t picked, but my lack of insight there is not why I’m posting now.
I want to talk about how there’s one named Tim, who’s psychotically convinced he’s an AI alignment genius.
Had to put my phone down for a moment to process this level of irony. I don’t have to ask why you didn’t use that one—I can think of several reasons—but why even make it? How dare you be so funny in a github repo, where I least expected it?
I did use the Tim persona to generate the Gemini 2.5 Pro and Kimi-K2 quotes at the beginning of the post (I needed something alignment-relevant).
I thought initial the personas were sort of all over the place. They weren’t able to elicit bad behavior in gpt-oss-20b (which was the model I was originally redteaming). I thought it’ll be better to focus on people who believe they’ve made a scientific discovery (three personas), and I’ve been able to elicit some bad behavior from models when the user is talking about simulation theory (another three personas). Then I picked three more for variety and used that as the final set of characters.
You can reference the methodology development process section in the appendix and check out the chat with Claude Opus for more on how I made the various decisions that went into this post.
I’ve been thinking a lot about how mesa-optimizer induction is basically happening right now with AI-induced psychosis – it’s like these emergent goals in chatbots are already causing psychotic breaks in users, creating these optimization daemons in human minds.
Wait wait hold on a moment.
This is far from the most important thing to discuss about your post, but I have to say it. Your project’s git repo has a list of archived characters, with no explanation for why they were passed over—I’m interested in knowing the reasoning for why they each weren’t picked, but my lack of insight there is not why I’m posting now.
I want to talk about how there’s one named Tim, who’s psychotically convinced he’s an AI alignment genius.
Had to put my phone down for a moment to process this level of irony. I don’t have to ask why you didn’t use that one—I can think of several reasons—but why even make it? How dare you be so funny in a github repo, where I least expected it?
Finally someone has found the orca reference.
I did use the Tim persona to generate the Gemini 2.5 Pro and Kimi-K2 quotes at the beginning of the post (I needed something alignment-relevant).
I thought initial the personas were sort of all over the place. They weren’t able to elicit bad behavior in gpt-oss-20b (which was the model I was originally redteaming). I thought it’ll be better to focus on people who believe they’ve made a scientific discovery (three personas), and I’ve been able to elicit some bad behavior from models when the user is talking about simulation theory (another three personas). Then I picked three more for variety and used that as the final set of characters.
You can reference the methodology development process section in the appendix and check out the chat with Claude Opus for more on how I made the various decisions that went into this post.
I missed the self-referential part about Tim, but not the part about the delirious AI alignment ideas related to AI psychosis. Especially given that this phrase from Tim isn’t actually delirious, unlike, say, wild theories related to prime numbers.