Of course very few “doomers” think that current LLMs behave in ways that we parse as “nice” because they have a “hidden, instrumental motive for being nice” (in the sense that I expect you meant that). Current LLMs likely aren’t coherent & self-aware enough to have such hidden, instrumental motives at all.
If MIRI-adjacent pessimists think that, I think they should stop saying things like this, which—if you don’t think LLMs have instrumental motives—is the actual opposite of good communication:
@Pradyumna: “I’m struggling to understand why LLMs are existential risks. So let’s say you did have a highly capable large language model. How could RLHF + scalable oversight fail in the training that could lead to every single person on this earth dying?”
@ESYudkowsky: “Suppose you captured an extremely intelligent alien species that thought 1000 times faster than you, locked their whole civilization in a spatial box, and dropped bombs on them from the sky whenever their output didn’t match a desired target—as your own intelligence tried to measure that.
What could they do to you, if when the ‘training’ phase was done, you tried using them the same way as current LLMs—eg, connecting them directly to the Internet?”
(To the reader, lest you are concerned by this—the process of RLHF has no resemblance to this.)
Of course very few “doomers” think that current LLMs behave in ways that we parse as “nice” because they have a “hidden, instrumental motive for being nice” (in the sense that I expect you meant that). Current LLMs likely aren’t coherent & self-aware enough to have such hidden, instrumental motives at all.
I agree with you about LLMs!
If MIRI-adjacent pessimists think that, I think they should stop saying things like this, which—if you don’t think LLMs have instrumental motives—is the actual opposite of good communication:
@Pradyumna: “I’m struggling to understand why LLMs are existential risks. So let’s say you did have a highly capable large language model. How could RLHF + scalable oversight fail in the training that could lead to every single person on this earth dying?”
@ESYudkowsky: “Suppose you captured an extremely intelligent alien species that thought 1000 times faster than you, locked their whole civilization in a spatial box, and dropped bombs on them from the sky whenever their output didn’t match a desired target—as your own intelligence tried to measure that.
What could they do to you, if when the ‘training’ phase was done, you tried using them the same way as current LLMs—eg, connecting them directly to the Internet?”
(To the reader, lest you are concerned by this—the process of RLHF has no resemblance to this.)