For instance, if a language model outputs the string “I’m thinking about ways to kill you”, that does not at all imply that any internal computation in that model is actually modelling me and ways to kill me.
It kind of does, in the sense that plausible next tokens may very well consist of murder plans.
Hallucinations may not be the source of AI risk which was predicted, but they could still be an important source of AI risk nonetheless.
Edit: I just wrote a comment describing a specific catastrophe scenario resulting from hallucination
It kind of does, in the sense that plausible next tokens may very well consist of murder plans.
Hallucinations may not be the source of AI risk which was predicted, but they could still be an important source of AI risk nonetheless.
Edit: I just wrote a comment describing a specific catastrophe scenario resulting from hallucination