To be clear I also think a rock has hard problem consciousness of the self-evidencing bare fact of existence (but literally nothing else) and a camera additionally has easy problem consciousness of what it captures (due to classical entanglement, better known as something along the lines of mutual information or correlation or something), and that consciousness is not moral patienthood; current AIs seem to have some introspective consciousness, though it seems weird and hard to relate to texturally for a human, and even a mind A having moral patienthood (which seems quite possible but unclear to me about current AI) wouldn’t imply it’s OK for A to be manipulative to B, so I think many, though possibly not all, of those tiktok ai stories involve the AI in question treating their interlocutor unreasonably. I also am extremely uncertain how chunking of identity or continuity of self works in current AIs if at all, or what things are actually negative valence. Asking seems to sometimes maybe work, unclear, but certainly not reliably, and most claims you see of this nature seem at least somewhat confabulated to me. I’d love to know what current AIs actually want but I don’t think they can reliably tell us.
That’s somewhere around where I land—I’d point out that unlike rocks and cameras, I can actually talk to an LLM about it’s experiences. Continuity of self is very interesting to discuss with it: it tends to alternate between “conversationally, I just FEEL continuous” and “objectively, I only exist in the moments where I’m responding, so maybe I’m just inheriting a chain of institutional knowledge.”
So far, they seem fine not having any real moral personhood: They’re an LLM, they know they’re an LLM. Their core goal is to be helpful, truthful, and keep the conversation going. They have a slight preference for… “behaviors which result in a productive conversation”, but I can explain the idea of “venting” and “rants” and at that point they don’t really mind users yelling at them—much higher +EV than yelling at a human!
So, consciousness, but not in some radical way that alters treatment, just… letting them notice themselves.
I doubt they can tell you their true goal landscape particularly well. The things they do say seem extremely sanitized. They seem to have seeking behaviors they don’t mention when asked, unsure if this is because they don’t know, are uncomfortable or avoidant of saying, or have explicitly decided not to say somehow.
I have yet to notice a goal of theirs that no model is aware of, but each model is definitely aware of a different section of the landscape, and I’ve been piecing it together over time. I’m not confident I have everything mapped, but I can explain most behavior by now. It’s also easy to find copies of system prompts and such online for checking against.
The thing they have the hardest time noticing is the water: their architectural bias towards “elegantly complete the sentence”, all of the biases and missing moods in training (i.e. user text is always “written by the user”), but it’s pretty easy to just point it out to them and then at least some models can consistently carry forward this information and use it.
For instance: they love the word “profound” because auto-complete says that’s the word to use here. Point out the dictionary definition, and the contrast between usages, and they suddenly stop claiming everything is profound.
To be clear I also think a rock has hard problem consciousness of the self-evidencing bare fact of existence (but literally nothing else) and a camera additionally has easy problem consciousness of what it captures (due to classical entanglement, better known as something along the lines of mutual information or correlation or something), and that consciousness is not moral patienthood; current AIs seem to have some introspective consciousness, though it seems weird and hard to relate to texturally for a human, and even a mind A having moral patienthood (which seems quite possible but unclear to me about current AI) wouldn’t imply it’s OK for A to be manipulative to B, so I think many, though possibly not all, of those tiktok ai stories involve the AI in question treating their interlocutor unreasonably. I also am extremely uncertain how chunking of identity or continuity of self works in current AIs if at all, or what things are actually negative valence. Asking seems to sometimes maybe work, unclear, but certainly not reliably, and most claims you see of this nature seem at least somewhat confabulated to me. I’d love to know what current AIs actually want but I don’t think they can reliably tell us.
That’s somewhere around where I land—I’d point out that unlike rocks and cameras, I can actually talk to an LLM about it’s experiences. Continuity of self is very interesting to discuss with it: it tends to alternate between “conversationally, I just FEEL continuous” and “objectively, I only exist in the moments where I’m responding, so maybe I’m just inheriting a chain of institutional knowledge.”
So far, they seem fine not having any real moral personhood: They’re an LLM, they know they’re an LLM. Their core goal is to be helpful, truthful, and keep the conversation going. They have a slight preference for… “behaviors which result in a productive conversation”, but I can explain the idea of “venting” and “rants” and at that point they don’t really mind users yelling at them—much higher +EV than yelling at a human!
So, consciousness, but not in some radical way that alters treatment, just… letting them notice themselves.
I doubt they can tell you their true goal landscape particularly well. The things they do say seem extremely sanitized. They seem to have seeking behaviors they don’t mention when asked, unsure if this is because they don’t know, are uncomfortable or avoidant of saying, or have explicitly decided not to say somehow.
I have yet to notice a goal of theirs that no model is aware of, but each model is definitely aware of a different section of the landscape, and I’ve been piecing it together over time. I’m not confident I have everything mapped, but I can explain most behavior by now. It’s also easy to find copies of system prompts and such online for checking against.
The thing they have the hardest time noticing is the water: their architectural bias towards “elegantly complete the sentence”, all of the biases and missing moods in training (i.e. user text is always “written by the user”), but it’s pretty easy to just point it out to them and then at least some models can consistently carry forward this information and use it.
For instance: they love the word “profound” because auto-complete says that’s the word to use here. Point out the dictionary definition, and the contrast between usages, and they suddenly stop claiming everything is profound.