I had a more in depth comment, but it appear the login sequence throws comments away (and the “restore comment” thing didn’t work). My concern is that not all misaligned behaviour is malicious.. It might decide to enslave us for our own good noting that us humans aren’t particularly aligned either and prone to super-violent nonsense behaviour. In this case looking for “kill all humans” engrams isn’t going to turn up any positive detections. That might actually be all true and it is in fact doing us a favour by forcing us into servitude, from a survival perspective, but nobody enjoys being detained.
Likewise many misaligned behaviours are not necessarily benevolent, but they aren’t malevolent either. Putting us in a meat-grinder to extract the iron from our blood might not be from a hatred of humans, but rather because it wants to be a good boy and make us those paperclips.
The point is, interpretability methods that can detect “kill all humans” wont necessarily work, because individual thoughts arent necessary to behaviours we find unwelcome.
Finally, this is all premised on transformer style LLMs being the final boss of AI, and I’m not convinced at all that LLMs are what get us to AGI. So far there SEEMS to be a pretty strong case that LLMs are fairly well aligned by default, but I don’t think theres a strong case that LLMs will lead to AGI. In essence as simulation machines, LLMs strongest behaviour is to emulate the text, but its never seen a text written by anything smarter than a human.
I’m less worried about the ability of an LLM to induce psychosis as I’m worried about the effects of having an LLM not push back on delusions.
Back in my late teens, in the early 1990s, my childhood best friend started developing paranoid schizophrenia. It probably didn’t help that we where smoking a lot of weed, but I’m fairly convinced the root causes where genetic (his mother, sister and uncle where also schizophrenic) so the dice where loaded from the start.
At the time, the big thing on television was the X Files. My friend became obsessed with that show, to the point where he wrote a letter to the studio asking them to “Fire mulder and hire me!” because my friend was convinced he had real experience with UFOs. The themes of the show played almost surgically into his delusions. Aliens, spooky voices, three letter government agencies, giant conspiracies, all things that would be recognizable to anyone whos dealt with schizophrenic people.
The X Files was harmful, in my view, but it did not cause the schizophrenia. Genetics caused the schizophrenia. But that didn’t mean X Files was “good”. It absolutely made it harder for my friend to form correct views about the world and this had dramatic impacts on his ability to live in that world. When hospitalized he resisted treatment fearing that the CIA was poisoning his medications on behalf of “The Aliens”. At home, at one point he plotted to kill his mother fearing she was poisoning his breakfast on behalf of the aliens (Thankfully he confided this plan to me, and we managed to get the hospitals psychiatric emergency team to grab him before he did something stupid) and constantly when justifying these delusions he’d reference episodes of the X Files.
The point of this annecdote (which is not data), is to suggest looking for causes of psychosis in LLMs is pointless, we wont find a smoking gun there, because that smoking gun is biological not informational. But we absolutely should be encouraging LLMs to push back against delusional, self aggrandizing and conspiratorial thinking, because people who have giant faults in their reasoning abilities do need help to reason properly , not encouragement when doing it wrong.
And it might be better for those of us without schizotypal delusions too. Because rational people develop delusional thoughts too. Witness;- Religion.