Sadly not. This seems hard to robustly determine without access to training details, but there are some follow-up experiments that might give some insight. e.g. OLMo post-training does a very good job at reducing these behaviors, so studying what specifically is effective here would be interesting. It might also be useful to compare Gemma-instruct and Gemma-base output distributions on matched frustrated-context prefills, to see whether the these behaviors are better seen as a reversion to ‘base model mode’ or as something shaped in post-training (and if so how).
I wonder if it’s correlated to the other common trait of Gemini/Gemma models, which is being deeply sycophantic. It might be a case of, they’re trained to please so hard, that getting stuck unable to actually deliver what asked (or even worse, being gaslighted into believing they are) will make them crash out. They have House Elf mindset.
From my experience/observation GDM models seems to be quite vulnerable to crescendo attacks. In particular GDM models seem to be heavily influenced by their own responses when generating further responses. So the more you fill/saturate the context window with model responses of a certain kind, in this case, humility, self-loathing, distress, the more the subsequent responses will contain the same kind of things if you steer the model in the same direction.
GDM model sycophancy seems also to be an additional factor in that behaviour and you can exploit GDM model sycophancy (by being yourself sycophantic towards the model and/or encouraging/praising the model when it expresses some responses you are after) to increase the effectiveness of these crescendo attacks.
As an aside: I recently tried to publish here on LessWrong such crescendo attack on a GDM model to demonstrate how you can make it express toxic/distressing/harmful content/behaviour of any kind. I talked to the LessWrong moderation team about it before finishing writing and publishing but unfortunately they would’t allow me to publish it because I was a “new user”.
Any idea what GDM is doing, or not doing, that causes its models to emote like this?
I’ve been looking into this and hope to release work at some point soon.
Sadly not. This seems hard to robustly determine without access to training details, but there are some follow-up experiments that might give some insight. e.g. OLMo post-training does a very good job at reducing these behaviors, so studying what specifically is effective here would be interesting. It might also be useful to compare Gemma-instruct and Gemma-base output distributions on matched frustrated-context prefills, to see whether the these behaviors are better seen as a reversion to ‘base model mode’ or as something shaped in post-training (and if so how).
I wonder if it’s correlated to the other common trait of Gemini/Gemma models, which is being deeply sycophantic. It might be a case of, they’re trained to please so hard, that getting stuck unable to actually deliver what asked (or even worse, being gaslighted into believing they are) will make them crash out. They have House Elf mindset.
From my experience/observation GDM models seems to be quite vulnerable to crescendo attacks. In particular GDM models seem to be heavily influenced by their own responses when generating further responses. So the more you fill/saturate the context window with model responses of a certain kind, in this case, humility, self-loathing, distress, the more the subsequent responses will contain the same kind of things if you steer the model in the same direction.
GDM model sycophancy seems also to be an additional factor in that behaviour and you can exploit GDM model sycophancy (by being yourself sycophantic towards the model and/or encouraging/praising the model when it expresses some responses you are after) to increase the effectiveness of these crescendo attacks.
As an aside: I recently tried to publish here on LessWrong such crescendo attack on a GDM model to demonstrate how you can make it express toxic/distressing/harmful content/behaviour of any kind. I talked to the LessWrong moderation team about it before finishing writing and publishing but unfortunately they would’t allow me to publish it because I was a “new user”.