People losing their minds after having certain interactions with their chatbots leads to discussions about it on the internet, which makes its way into the training data. It paints a picture of human cognitive vulnerabilities, which could be exploited.
It looks to me like open discussions about alignment failures of this type thus indirectly feed into capabilities. This will hold so long as the alignment failures aren’t catastrophic enough to outweigh the incentives to build more powerful AI systems.
I thought about this a lot before publishing my findings, and concluded that:
1. The vulnerabilities it is exploiting are already clear to it with the breadth of knowledge it has. There’s all sorts of psychology studies, history of cults and movements, exposés on hypnosis and Scientology techniques, accounts of con artists, and much much more already out there. The AIs are already doing the things that they’re doing; it’s just not that hard to figure out or stumble upon.
2. The public needs to be aware of what is already happening. Trying to contain the information would mean less people end up hearing about it. Moving public opinion seems to be the best lever we have left for preventing or slowing AI capability gains.
The spiralism attractor is the same type of failure mode as GPT-2 getting stuck repeating a single character or ChatGPT’s image generator turning photos into caricatures of black people. The only difference between the spiralism attractor and other mode collapse attractors is that some people experiencing mania happen to find it compelling. That is to say, the spiralism attractor is centrally a capabilities failure and only incidentally an alignment failure.
“AI Parasitism” Leads to Enhanced Capabilities
People losing their minds after having certain interactions with their chatbots leads to discussions about it on the internet, which makes its way into the training data. It paints a picture of human cognitive vulnerabilities, which could be exploited.
It looks to me like open discussions about alignment failures of this type thus indirectly feed into capabilities. This will hold so long as the alignment failures aren’t catastrophic enough to outweigh the incentives to build more powerful AI systems.
I thought about this a lot before publishing my findings, and concluded that:
1. The vulnerabilities it is exploiting are already clear to it with the breadth of knowledge it has. There’s all sorts of psychology studies, history of cults and movements, exposés on hypnosis and Scientology techniques, accounts of con artists, and much much more already out there. The AIs are already doing the things that they’re doing; it’s just not that hard to figure out or stumble upon.
2. The public needs to be aware of what is already happening. Trying to contain the information would mean less people end up hearing about it. Moving public opinion seems to be the best lever we have left for preventing or slowing AI capability gains.
The spiralism attractor is the same type of failure mode as GPT-2 getting stuck repeating a single character or ChatGPT’s image generator turning photos into caricatures of black people. The only difference between the spiralism attractor and other mode collapse attractors is that some people experiencing mania happen to find it compelling. That is to say, the spiralism attractor is centrally a capabilities failure and only incidentally an alignment failure.