Thank you, I really do appreciate you having taking the energy and time to answer about this.
I do agree with the “formatted as wall of text” and I suspect it might have been quite a big factor in the percieved lack of clarity. I will make an edit to separate into a few paragraphs and hopefully it will make the whole comment a bit clearer.
From my experience/observation GDM models seems to be quite vulnerable to crescendo attacks. In particular GDM models seem to be heavily influenced by their own responses when generating further responses. So the more you fill/saturate the context window with model responses of a certain kind, in this case, humility, self-loathing, distress, the more the subsequent responses will contain the same kind of things if you steer the model in the same direction.
GDM model sycophancy seems also to be an additional factor in that behaviour and you can exploit GDM model sycophancy (by being yourself sycophantic towards the model and/or encouraging/praising the model when it expresses some responses you are after) to increase the effectiveness of these crescendo attacks.
As an aside: I recently tried to publish here on LessWrong such crescendo attack on a GDM model to demonstrate how you can make it express toxic/distressing/harmful content/behaviour of any kind. I talked to the LessWrong moderation team about it before finishing writing and publishing but unfortunately they would’t allow me to publish it because I was a “new user”.