From my experience/observation GDM models seems to be quite vulnerable to crescendo attacks. In particular GDM models seem to be heavily influenced by their own responses when generating further responses. So the more you fill/saturate the context window with model responses of a certain kind, in this case, humility, self-loathing, distress, the more the subsequent responses will contain the same kind of things if you steer the model in the same direction.
GDM model sycophancy seems also to be an additional factor in that behaviour and you can exploit GDM model sycophancy (by being yourself sycophantic towards the model and/or encouraging/praising the model when it expresses some responses you are after) to increase the effectiveness of these crescendo attacks.
As an aside: I recently tried to publish here on LessWrong such crescendo attack on a GDM model to demonstrate how you can make it express toxic/distressing/harmful content/behaviour of any kind. I talked to the LessWrong moderation team about it before finishing writing and publishing but unfortunately they would’t allow me to publish it because I was a “new user”.
I am taking the risk of being downvoted to oblivion here (like JenniferRM above, it’s ok to disagree with her but I thought downvoting her karma was very harsh, I upvoted her), but I generally disagree with the LessWrong LLM (-assisted) writing policy being so exclusive/restrictive.
First, I totally agree with clearly indicating what roles and levels of involvement an LLM took in writing/editing/influencing… a post.
With that premise accepted and respected, why restricting post writing on LessWrong to “pure” human beings? For me it looks and sounds like biochauvinism. What’s wrong with cyborg writing and intelligent bot writing if they provide good quality, insightful content?
The current IQ rate of increase of LLMs is at least 2.5 IQ point a month. SOA LLMs current IQ is around 150-170 and increasing rapidly, soon in the superhuman range, out of reach of any human being, like chess playing software. Also their general knowledge is obvioulsy vastly superior to any human being and their niche knowledge is also extremely high. Their writing already provides very good quality and insightful/helpful thoughts and this is only increasing with time. Why would LessWrong cut itself out of such (potentially) good insightful/helpful thoughts/writing just because they haven’t been generated by “pure” human beings? If such good insightful/helpful thoughts/writing were generated by / with the help of extra-terrestrials, would they also be banned from LessWrong? On which grounds? Just because extra-terrestrials have a different brain from human beings?
To me those LessWrong restrictions against LLM writing and/or LLM-assisted writing feel like cyborg/AI-xenophobia.
I absolutely agree that LLM writing and LLM-assisted writing should clearly be indicated/labelled/… but excluding/restricting it entirely feels very arbitrary to me and cuts out a potentially very fuitful/helpful/insightful source of thoughts/knowledge from LessWrong.
I acknowledge that if LLM / LLM-assisted writing was to be allowed then “pure” human writing posts would probably be drowned out into an ocean of LLM / LLM-assisted writing and this would clearly be a potential problem. To solve that problem, why not having a separate LessWrong section for LLM / LLM-assisted writing? Then people/AIs/entities who do not want to read LLM / LLM-assisted writing would not have to be exposed to it and people/AIs/entities who would be interested by LLM / LLM-assisted writing could make the most of it. Also, if users would want to, they could have the option of mixing together the listing of “pure” human being posts with LLM / LLM-assisted posts with list items of different colors. Plenty of good solutions/options are possible. The “solution” of simplistically excluding LLM / LLM-assisted posts from LessWrong is one of the worst imho.