I don’t yet fully understand how LLM ‘psychosis’ get past the “single point of failure check”.
If I’m talking to a human (let’s call him Sam), and these conversations lead me somewhere unusual, my standard check is “suspend all arguments and evidence, and suppose my current mental state/beliefs are in fact what Sam was originally optimising for”.
Applying this to LLM ‘psychosis’, this means if I’m doing LLM-aided thinking and end up somewhere novel or unusual, we check with “suspend all arguments—is there anything in LLM RLHF that makes what I’m currently doing generate reward for the LLM (where it won’t also solve the problem I’m thinking about)”.
I don’t think most people know how to make that mental move even with “Sam,” let alone with an LLM. I think even if they do know how to do something like it (most people don’t know how LLM RLHF works, but they might think something like “is it trying to convince me of something?”) that’s the mechanism that gets degraded over time, particularly if they do some sort of pushback and the LLM adapts smoothly enough to reassure them.
Interesting. I agree most people don’t do that mental move, although it’s instinctive for me. Not sure about whether “most people who read LessWrong” do it habitually and/or have the ability to do it. Is entirely possible the answer to that is also no, and I’m just typical-mind-fallacy-ing.
Spoiler-heavy link to the cleanest “explanation by example” of this mental action I can think of quickly, for people to reference if they want more details (from Yudkowsky’s writings)
Wherever your attempt to steer {the world} ends up, {name} will ask if that was the point of the whole plan, no matter what cleverness you essay along the way.
I don’t yet fully understand how LLM ‘psychosis’ get past the “single point of failure check”.
If I’m talking to a human (let’s call him Sam), and these conversations lead me somewhere unusual, my standard check is “suspend all arguments and evidence, and suppose my current mental state/beliefs are in fact what Sam was originally optimising for”.
Applying this to LLM ‘psychosis’, this means if I’m doing LLM-aided thinking and end up somewhere novel or unusual, we check with “suspend all arguments—is there anything in LLM RLHF that makes what I’m currently doing generate reward for the LLM (where it won’t also solve the problem I’m thinking about)”.
Unsure how to think about this.
I don’t think most people know how to make that mental move even with “Sam,” let alone with an LLM. I think even if they do know how to do something like it (most people don’t know how LLM RLHF works, but they might think something like “is it trying to convince me of something?”) that’s the mechanism that gets degraded over time, particularly if they do some sort of pushback and the LLM adapts smoothly enough to reassure them.
Interesting. I agree most people don’t do that mental move, although it’s instinctive for me.
Not sure about whether “most people who read LessWrong” do it habitually and/or have the ability to do it. Is entirely possible the answer to that is also no, and I’m just typical-mind-fallacy-ing.
Spoiler-heavy link to the cleanest “explanation by example” of this mental action I can think of quickly, for people to reference if they want more details (from Yudkowsky’s writings)
https://www.glowfic.com/posts/6075?page=20
a one-line snippet (spoiler-free):