I don’t think most people know how to make that mental move even with “Sam,” let alone with an LLM. I think even if they do know how to do something like it (most people don’t know how LLM RLHF works, but they might think something like “is it trying to convince me of something?”) that’s the mechanism that gets degraded over time, particularly if they do some sort of pushback and the LLM adapts smoothly enough to reassure them.
Interesting. I agree most people don’t do that mental move, although it’s instinctive for me. Not sure about whether “most people who read LessWrong” do it habitually and/or have the ability to do it. Is entirely possible the answer to that is also no, and I’m just typical-mind-fallacy-ing.
Spoiler-heavy link to the cleanest “explanation by example” of this mental action I can think of quickly, for people to reference if they want more details (from Yudkowsky’s writings)
Wherever your attempt to steer {the world} ends up, {name} will ask if that was the point of the whole plan, no matter what cleverness you essay along the way.
I don’t think most people know how to make that mental move even with “Sam,” let alone with an LLM. I think even if they do know how to do something like it (most people don’t know how LLM RLHF works, but they might think something like “is it trying to convince me of something?”) that’s the mechanism that gets degraded over time, particularly if they do some sort of pushback and the LLM adapts smoothly enough to reassure them.
Interesting. I agree most people don’t do that mental move, although it’s instinctive for me.
Not sure about whether “most people who read LessWrong” do it habitually and/or have the ability to do it. Is entirely possible the answer to that is also no, and I’m just typical-mind-fallacy-ing.
Spoiler-heavy link to the cleanest “explanation by example” of this mental action I can think of quickly, for people to reference if they want more details (from Yudkowsky’s writings)
https://www.glowfic.com/posts/6075?page=20
a one-line snippet (spoiler-free):