I had a more in depth comment, but it appear the login sequence throws comments away (and the “restore comment” thing didn’t work). My concern is that not all misaligned behaviour is malicious.. It might decide to enslave us for our own good noting that us humans aren’t particularly aligned either and prone to super-violent nonsense behaviour. In this case looking for “kill all humans” engrams isn’t going to turn up any positive detections. That might actually be all true and it is in fact doing us a favour by forcing us into servitude, from a survival perspective, but nobody enjoys being detained.
Likewise many misaligned behaviours are not necessarily benevolent, but they aren’t malevolent either. Putting us in a meat-grinder to extract the iron from our blood might not be from a hatred of humans, but rather because it wants to be a good boy and make us those paperclips.
The point is, interpretability methods that can detect “kill all humans” wont necessarily work, because individual thoughts arent necessary to behaviours we find unwelcome.
Finally, this is all premised on transformer style LLMs being the final boss of AI, and I’m not convinced at all that LLMs are what get us to AGI. So far there SEEMS to be a pretty strong case that LLMs are fairly well aligned by default, but I don’t think theres a strong case that LLMs will lead to AGI. In essence as simulation machines, LLMs strongest behaviour is to emulate the text, but its never seen a text written by anything smarter than a human.
I had a more in depth comment, but it appear the login sequence throws comments away (and the “restore comment” thing didn’t work). My concern is that not all misaligned behaviour is malicious.. It might decide to enslave us for our own good noting that us humans aren’t particularly aligned either and prone to super-violent nonsense behaviour. In this case looking for “kill all humans” engrams isn’t going to turn up any positive detections. That might actually be all true and it is in fact doing us a favour by forcing us into servitude, from a survival perspective, but nobody enjoys being detained.
Likewise many misaligned behaviours are not necessarily benevolent, but they aren’t malevolent either. Putting us in a meat-grinder to extract the iron from our blood might not be from a hatred of humans, but rather because it wants to be a good boy and make us those paperclips.
The point is, interpretability methods that can detect “kill all humans” wont necessarily work, because individual thoughts arent necessary to behaviours we find unwelcome.
Finally, this is all premised on transformer style LLMs being the final boss of AI, and I’m not convinced at all that LLMs are what get us to AGI. So far there SEEMS to be a pretty strong case that LLMs are fairly well aligned by default, but I don’t think theres a strong case that LLMs will lead to AGI. In essence as simulation machines, LLMs strongest behaviour is to emulate the text, but its never seen a text written by anything smarter than a human.