Another technique exploited a quirk of LLMs: They can emulate many “personas” aside from their default AI assistant persona.
When we have the model play both the assistant and user roles in a conversation, the simulated user sometimes reveals information the assistant wouldn’t.
This idea is similar to an idea of mine that Evan Hubinger asked me to investigate (He is my mentor at MATS):
Can we train the model to have a second personality, so that the second personality criticizes the first?
I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge
Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS):
Can we train the model to have a second personality, so that the second personality criticizes the first?
I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge