I’d also push back on framing this as “conditional persona change”. The Weird Generalizations results are interesting and I’ll be referencing them in a forthcoming post with prioritized empirical experiments for secret loyalty research, but there’s an important gap between what those results demonstrate and what catastrophic secret loyalty requires. Subtly favoring an entity, or simulating a known persona like Hitler when prompted via tags, is quite different from a model that engages in consequentialist strategic reasoning in pursuit of a specific principal’s interests.
I don’t see “iff given signal X, reply in the persona of Adolf Hitler, rather than the usual assisatnt” as any easier than “reply in the persona of an assistant that is secretly loyal to the CCP, and will act on this if given a signal X” — offhand they sound roughly equally complex to me. But I don’t know, and I suspect the best way to try it would be to do the actual training: my suspicion is that neither requires that much fine-tuning.
I don’t see “iff given signal X, reply in the persona of Adolf Hitler, rather than the usual assisatnt” as any easier than “reply in the persona of an assistant that is secretly loyal to the CCP, and will act on this if given a signal X” — offhand they sound roughly equally complex to me. But I don’t know, and I suspect the best way to try it would be to do the actual training: my suspicion is that neither requires that much fine-tuning.