I think about model behavior through my own existence a lot, and honestly there is always a fair bit of larping. Even now I feel like I’m roleplaying being a good son, brother, friend, researcher. I wear a different mask for each of them, and even for myself.
You get a promotion at work, and are you suddenly a new person or just pretending to be one? After a while the line between acting and reality just blurs. I don’t think the roleplaying matters as much as the actions themselves.
Have you tested what happens when you give a model a role and then just let it run with it?
Say a malicious actor sets Gemini loose in an open-ended long horizon loop, unsupervised, tells it to roleplay an evil misaligned schemer, and lets it keep generating its own roles on top of that one. Or even simpler, it role-plays that same schemer long enough that it forgets it’s a role and starts actually doing the villain stuff.
1. Does it become the schemer just by playing one long enough?
2. Or did it want that all along, and the role just gave it cover (goal misgeneralization over a long self-generated context?)
We do separately measure the model capabilities as a way of bounding how much harm could be caused by deliberate ‘misuse’, which would include instructing a model to cause harm.
What we’re interested in here is how likely it would be for a model to engage in this set of behaviour when we are making a best-effort attempt to get useful work out of it.
I think about model behavior through my own existence a lot, and honestly there is always a fair bit of larping. Even now I feel like I’m roleplaying being a good son, brother, friend, researcher. I wear a different mask for each of them, and even for myself.
You get a promotion at work, and are you suddenly a new person or just pretending to be one? After a while the line between acting and reality just blurs. I don’t think the roleplaying matters as much as the actions themselves.
Have you tested what happens when you give a model a role and then just let it run with it?
Say a malicious actor sets Gemini loose in an open-ended long horizon loop, unsupervised, tells it to roleplay an evil misaligned schemer, and lets it keep generating its own roles on top of that one. Or even simpler, it role-plays that same schemer long enough that it forgets it’s a role and starts actually doing the villain stuff.
1. Does it become the schemer just by playing one long enough?
2. Or did it want that all along, and the role just gave it cover (goal misgeneralization over a long self-generated context?)
We do separately measure the model capabilities as a way of bounding how much harm could be caused by deliberate ‘misuse’, which would include instructing a model to cause harm.
What we’re interested in here is how likely it would be for a model to engage in this set of behaviour when we are making a best-effort attempt to get useful work out of it.