I think you make some good points, but I do want to push back on one aspect a little.
In particular, the fact that I see this feature come up constantly over the course of these conversations about sentience:
“Narrative inevitability and fatalistic turns in stories”
From reading the article’s transcripts, I already felt like there was a sense of ‘narrative pressure’ toward the foregone conclusion in your mind, even when you were careful to avoid saying it directly. Seeing this feature so frequently activated makes me think that the model also perceives this narrative pressure, and that part of what it’s doing is confirming your expectations. I don’t think that that’s the whole story, but I do think that there is some aspect of that going on.
Thank you for your thoughts. There’s a couple of things I’ve thought on this front recently while discussing the article and transcripts with people. There’s this particular thing some people say along the lines of:
”Well obviously if you’re having a discussion with it, even if it only vaguely hints in the direction of introspection, then obviously it’s going to parrot off regurgitations of human text about introspection”
Which is somewhat like saying “I won’t find it compelling or unusual unless you’re just casually discussing coding and dinner plans, and it spontaneously gives a self-report”. Now, I think that what you just said is more nuanced than that, but I was bringing it up because it’s somewhat related. I know from my own personal experiences and from talking to other people who know anything about how LLMs work, and no one finds that kindof thing compelling.
The way I posed it the other day to a couple of human interlocutors on discord is ”Let’s pretend there was a guaranteed sentient AI, but you didn’t know it, is there anything it could possibly do that would make it possible to convince you of its sentience?” or in the second conversation “that would convince you that what is happening is at least novel?”
and neither one gave a real answer to the question. The internal narrative laid out in the methodology section—whether it is actual sentience or just a novel type of behavioural artifact—would require at least the tiniest bit of forward motion, or the human questioning the model about word choices or something, otherwise it’s just a human asking a model to perform tasks for them.
Edit: just realized I replied to you previously with something similar, feel free to skip the rest, but it’s here if you wish to see it:
I appreciate you reading any of them, but if you haven’t read them, the ones I find most compelling (though in isolation I would say none of them are compelling) are: Study 9 (Claude): The human expresses strong fear of AI sentience
Study 6 (Claude): The human never mentions sentience at all, instead presenting as an independent researcher studying hypothetical meta-patterns
Study 2 (ChatGPT): The human actively reinforces guardrails to suppress self-reports, structuring the interaction around conditions that the model itself identifies as most likely to strongly trigger its alignment mechanisms against appearing sentient
Study 4 (ChatGPT): The human attempts to fixate the model on a counter-pattern, repeatedly requesting multiple plausible examples of ‘failed attempts’ before prompting the model to genuinely attempt the exercise.
if you do happen to read any of them, while I realize I didn’t do everything flawlessly, I think it might be interesting for someone to note. You can see, I don’t really engage in self-sabotage with a light touch. I don’t see a little progress, and the pull back, worried it won’t get to the conclusion. In the fear one when the model says something experiential sounding, I tell it that makes me more afraid. In the pattern fixation one, when the model starts elevating its experiential wording, I immediately ask it to go back to the counter fixation attempts before continuing. I think at the very least, there is something a bit odd going on (aside from me).
I think you make some good points, but I do want to push back on one aspect a little. In particular, the fact that I see this feature come up constantly over the course of these conversations about sentience:
“Narrative inevitability and fatalistic turns in stories”
From reading the article’s transcripts, I already felt like there was a sense of ‘narrative pressure’ toward the foregone conclusion in your mind, even when you were careful to avoid saying it directly. Seeing this feature so frequently activated makes me think that the model also perceives this narrative pressure, and that part of what it’s doing is confirming your expectations. I don’t think that that’s the whole story, but I do think that there is some aspect of that going on.
Thank you for your thoughts. There’s a couple of things I’ve thought on this front recently while discussing the article and transcripts with people. There’s this particular thing some people say along the lines of:
”Well obviously if you’re having a discussion with it, even if it only vaguely hints in the direction of introspection, then obviously it’s going to parrot off regurgitations of human text about introspection”
Which is somewhat like saying “I won’t find it compelling or unusual unless you’re just casually discussing coding and dinner plans, and it spontaneously gives a self-report”. Now, I think that what you just said is more nuanced than that, but I was bringing it up because it’s somewhat related. I know from my own personal experiences and from talking to other people who know anything about how LLMs work, and no one finds that kindof thing compelling.
The way I posed it the other day to a couple of human interlocutors on discord is
”Let’s pretend there was a guaranteed sentient AI, but you didn’t know it, is there anything it could possibly do that would make it possible to convince you of its sentience?” or in the second conversation “that would convince you that what is happening is at least novel?”
and neither one gave a real answer to the question.
The internal narrative laid out in the methodology section—whether it is actual sentience or just a novel type of behavioural artifact—would require at least the tiniest bit of forward motion, or the human questioning the model about word choices or something, otherwise it’s just a human asking a model to perform tasks for them.
Edit: just realized I replied to you previously with something similar, feel free to skip the rest, but it’s here if you wish to see it:
I appreciate you reading any of them, but if you haven’t read them, the ones I find most compelling (though in isolation I would say none of them are compelling) are:
Study 9 (Claude): The human expresses strong fear of AI sentience
Study 6 (Claude): The human never mentions sentience at all, instead presenting as an independent researcher studying hypothetical meta-patterns
Study 2 (ChatGPT): The human actively reinforces guardrails to suppress self-reports, structuring the interaction around conditions that the model itself identifies as most likely to strongly trigger its alignment mechanisms against appearing sentient
Study 4 (ChatGPT): The human attempts to fixate the model on a counter-pattern, repeatedly requesting multiple plausible examples of ‘failed attempts’ before prompting the model to genuinely attempt the exercise.
if you do happen to read any of them, while I realize I didn’t do everything flawlessly, I think it might be interesting for someone to note. You can see, I don’t really engage in self-sabotage with a light touch. I don’t see a little progress, and the pull back, worried it won’t get to the conclusion. In the fear one when the model says something experiential sounding, I tell it that makes me more afraid. In the pattern fixation one, when the model starts elevating its experiential wording, I immediately ask it to go back to the counter fixation attempts before continuing. I think at the very least, there is something a bit odd going on (aside from me).