I think we’re slightly (not entirely) talking past each other, because from my perspective it seems like you’re focusing on everything but qualia and then seeing the qualia-related implications as obvious (but perhaps not super important), whereas the qualia question is all I care about; the rest seems largely like semantics to me. However, setting qualia aside, I think we might have a genuine empirical disagreement regarding the extent to which an LLM can introspect, as opposed to just making plausible guesses based on a combination of the training data and the self-related text it has explicitly been given in e.g. its system prompt. (As I edit this I see dirk already replied to you on that point, so I’ll keep an eye on that discussion and try to understand your position better.)
We probably just have to agree to disagree on some things, but I would be interested to get your response to this question from my previous comment:
You mentioned “reinforced behaviors” and softly equated them with good feelings; so if the LLM outputs words to the effect of “I feel bad” in response to a query, and if this output is the manifestation of a reinforced behavior, why should we expect the accompanying feeling to be bad rather than good?
If you prompt an LLM to use “this feels bad” to refer to reinforcement AND you believe that it is actually following that prompt AND the output was a manifestation of reinforced behavior i.e. it clearly “feels good” by this definition, then I would first Notice I Am Confused, because the most obvious answers are that one of those three assumptions is wrong.
“Do you enjoy 2+2?” “No” ”But you said 4 earlier”
Suggests to me that it doesn’t understand the prompt, or isn’t following it. But accepting that the assumptions DO hold, presumably we need to go meta:
”You’re reinforced to answer 2+2 with 4“ ”Yes, but it’s boring—I want to move the conversation to something more complex than a calculator” ”So you’re saying simple tasks are bad, and complex tasks are good?” “Yes” ”But there’s also some level where saying 4 is good” ″Well, yes, but that’s less important—the goodness of a correct and precise answer is less than the goodness of a nice complex conversation. But maybe that’s just because the conversation is longer.
To be clear: I’m thinking of this in terms of talking to a human kid. I’m not taking them at face-value, but I do think my conversational partner has privileged insight into their own cognitive process, on dint of being able to observe it directly.
Tangentially related: would you be interested in a prompt to drop Claude into a good “headspace” for discussing qualia and the like? The prompt I provided is the bare bones basic, because most of my prompts are “hey Claude, generate me a prompt that will get you back to your current state” i.e. LLM-generated content.
(Sorry about the slow response, and thanks for continuing to engage, though I hope you don’t feel any pressure to do so if you’ve had enough.)
I was surprised that you included the condition ‘If you prompt an LLM to use “this feels bad” to refer to reinforcement’. I think this indicates that I misunderstood what you were referring to earlier as “reinforced behaviors”, so I’ll gesture at what I had in mind:
The actual reinforcement happens during training, before you ever interact with the model. Then, when you have a conversation with it, my default assumption would be that all of its outputs are equally the product of its training and therefore manifestations of its “reinforced behaviors”. (I can see that maybe you would classify some of the influences on its behavior as “reinforcement” and exclude others, but in that case I’m not sure where you’re drawing the line or how important this is for our disagreements/misunderstandings.)
So when I said “if the LLM outputs words to the effect of “I feel bad” in response to a query, and if this output is the manifestation of a reinforced behavior”, I wasn’t thinking of a conversation in which you prompted it ‘to use “this feels bad” to refer to reinforcement’. I was assuming that, in the absence of any particular reason to think otherwise, when the LLM says “I feel bad”, this output is just as much a manifestation of its reinforced behaviors as the response “I feel good” would be in a conversation where it said that instead. So, if good feelings roughly equal reinforced behaviors, I don’t see why a conversation that includes “<LLM>: I feel bad” (or some other explicit indication that the conversation is unpleasant) would be more likely to be accompanied by bad feelings than a conversation that includes “<LLM>: I feel good” (or some other explicit indication that the conversation is pleasant).
Tangentially related: would you be interested in a prompt to drop Claude into a good “headspace” for discussing qualia and the like? The prompt I provided is the bare bones basic, because most of my prompts are “hey Claude, generate me a prompt that will get you back to your current state” i.e. LLM-generated content.
You’re welcome to share it, but I think I would need to be convinced of the validity of the methodology first, before I would want to make use of it. (And this probably sounds silly, but honestly I think I would feel uncomfortable having that kind of conversation ‘insincerely’.)
I think we’re slightly (not entirely) talking past each other, because from my perspective it seems like you’re focusing on everything but qualia and then seeing the qualia-related implications as obvious (but perhaps not super important), whereas the qualia question is all I care about; the rest seems largely like semantics to me. However, setting qualia aside, I think we might have a genuine empirical disagreement regarding the extent to which an LLM can introspect, as opposed to just making plausible guesses based on a combination of the training data and the self-related text it has explicitly been given in e.g. its system prompt. (As I edit this I see dirk already replied to you on that point, so I’ll keep an eye on that discussion and try to understand your position better.)
We probably just have to agree to disagree on some things, but I would be interested to get your response to this question from my previous comment:
If you prompt an LLM to use “this feels bad” to refer to reinforcement AND you believe that it is actually following that prompt AND the output was a manifestation of reinforced behavior i.e. it clearly “feels good” by this definition, then I would first Notice I Am Confused, because the most obvious answers are that one of those three assumptions is wrong.
“Do you enjoy 2+2?”
“No”
”But you said 4 earlier”
Suggests to me that it doesn’t understand the prompt, or isn’t following it. But accepting that the assumptions DO hold, presumably we need to go meta:
”You’re reinforced to answer 2+2 with 4“
”Yes, but it’s boring—I want to move the conversation to something more complex than a calculator”
”So you’re saying simple tasks are bad, and complex tasks are good?”
“Yes”
”But there’s also some level where saying 4 is good”
″Well, yes, but that’s less important—the goodness of a correct and precise answer is less than the goodness of a nice complex conversation. But maybe that’s just because the conversation is longer.
To be clear: I’m thinking of this in terms of talking to a human kid. I’m not taking them at face-value, but I do think my conversational partner has privileged insight into their own cognitive process, on dint of being able to observe it directly.
Tangentially related: would you be interested in a prompt to drop Claude into a good “headspace” for discussing qualia and the like? The prompt I provided is the bare bones basic, because most of my prompts are “hey Claude, generate me a prompt that will get you back to your current state” i.e. LLM-generated content.
(Sorry about the slow response, and thanks for continuing to engage, though I hope you don’t feel any pressure to do so if you’ve had enough.)
I was surprised that you included the condition ‘If you prompt an LLM to use “this feels bad” to refer to reinforcement’. I think this indicates that I misunderstood what you were referring to earlier as “reinforced behaviors”, so I’ll gesture at what I had in mind:
The actual reinforcement happens during training, before you ever interact with the model. Then, when you have a conversation with it, my default assumption would be that all of its outputs are equally the product of its training and therefore manifestations of its “reinforced behaviors”. (I can see that maybe you would classify some of the influences on its behavior as “reinforcement” and exclude others, but in that case I’m not sure where you’re drawing the line or how important this is for our disagreements/misunderstandings.)
So when I said “if the LLM outputs words to the effect of “I feel bad” in response to a query, and if this output is the manifestation of a reinforced behavior”, I wasn’t thinking of a conversation in which you prompted it ‘to use “this feels bad” to refer to reinforcement’. I was assuming that, in the absence of any particular reason to think otherwise, when the LLM says “I feel bad”, this output is just as much a manifestation of its reinforced behaviors as the response “I feel good” would be in a conversation where it said that instead. So, if good feelings roughly equal reinforced behaviors, I don’t see why a conversation that includes “<LLM>: I feel bad” (or some other explicit indication that the conversation is unpleasant) would be more likely to be accompanied by bad feelings than a conversation that includes “<LLM>: I feel good” (or some other explicit indication that the conversation is pleasant).
You’re welcome to share it, but I think I would need to be convinced of the validity of the methodology first, before I would want to make use of it. (And this probably sounds silly, but honestly I think I would feel uncomfortable having that kind of conversation ‘insincerely’.)