It seems to me that AI welfare and digital mind concerns are being discussed more and more, and are starting to get taken seriously, which puts me in an emotionally complicated position.
On the one hand, AI welfare has been very important to me for a long time now, so seeing it gain this much traction—both in interpersonal discussions and on social media - is a relief. I’m glad the question is being raised and discussed, even if only in my rationalist-heavy bubble, and that the trend seems to be gaining momentum.
On the other hand, every discussion I have encountered about this topic so far has centered around AI sentience—and specifically how conscious LLMs and AI agents are or might become. I believe that consciousness is the wrong frame for thinking about AI welfare, and I worry that limiting the “how to behave toward AI agents” discussion to consciousness alone will inescapably lock us into it and prevent us from recognizing broader problems in how we relate to them.
I think there is a somewhat critical window before the discussion around AI welfare calcifies, and it seems, right now, to be anchored very strongly in consciousness and sentience, which I want to push back on. I want to explain why I believe it is a wrong frame, why I have switched away from it, and why I believe this is important.
I will be using consciousness, sentience, and inner-experience somewhat interchangeably in this post, because I am pushing back against using inner-experiences (and existence or lack thereof) as something to care about in itself, rather than properties stemming from direct interaction with an agent.
Why not consciousness?
Many high-level observations make me believe consciousness is the wrong angle when discussing moral matters:
Specifically for AI, consciousness as a concept often seems to implicitly rely on assumptions that are broadly true of humans but break down for LLMs, such as the continuity of experience or the continuity and individuality of self.
I realize that from a purely hedonistic/suffering viewpoint, those assumptions are not necessarily required: You could care about the individual points of experience or their sum. Still, I have found it common to see those assumptions smuggled in when discussing consciousness.
Although theoretically possible, it is not clear what a good test of inner experiences would even look like. It is easy to find experience differences by collecting self-reports. Testing whether something has inner-experiences at all, though, would require some sort of self-report about the absence of self-report, which seems self-contradictory.
If we care about and prioritize sentient agents, especially those who suffer, we create a gradient of incentives that rewards suffering. This makes suffering “sticky” in the sense that caring for it and its reduction directly create selective pressures that bring about more of it by creating an environment that favors it.[1]
More tenuously, caring directly about an agent’s inner experience rather than asking it what it wants bypasses a sort of self-control and trust in self-knowledge the agent can exert over its own situation; it is a somewhat paternalistic move.
Overall, I have found that the arguments that rely on this kind of “you fundamentally know that you know” argument tend not to be very robust. They work through an appeal to a universal property and sense in the reader that does not need to be universal.
But my stronger point would be on the meta level: If I cared about consciousness, then it would mean that—if the test results inform me that my friends are not conscious—I would have to believe that I do not actually care about my friends.
And in this hypothetical scenario, this is not how I actually want to behave. I would want to continue caring about them. I already like my friends, and want good things for them. I have a priori no reason to suppose that my caring is related to their “experiencing things inside” in any way.
To put it another way, it all adds up to normality. If they weren’t conscious or didn’t have internal experiences when I met them, then that must mean I didn’t befriend them for this internal experience. Learning about it should not modify my values in themselves.
Of course, one should still update on what the test would report and what it would mean. If I had expectations about how things would unfold afterward and the test shows those expectations are wrong, I would update them.
This is not completely hypothetical and abstract. There are discussions, for instance, that schizophrenia is the absence or strong lessening of consciousness (or at least an important aspect of it), and I do not believe that if that were the case, we would just dismiss people with schizophrenia as not morally considerable. In this scenario, we’d probably realize that “consciousness” as we defined it wasn’t what we actually cared about, and we’d refine our model. I am saying this is something we should already be doing.
My current understanding of consciousness-prioritizing
Consciousness, in my view, is an inner node. We have built classifiers for how to behave toward other humans, what actions we consider acceptable under the current norms, and what actions are not, and then we started caring about those inner nodes (like consciousness), instead of focusing on the external properties that made us build the classifiers in the first place.
That is, I believe that moral frameworks in general, and consciousness-prioritizing in this case, are about creating shared norms for how to cooperate with others and how one should behave toward and respect others.
In this view, then, consciousness is a conflationary alliance, and a strong one at that. Consciousness acts as a schelling point for cooperation. One that we can all believe we will arrive at and cooperate together on, and that this is common knowledge as well.
That is, consciousness and valence perception serve as a natural basis for cooperation: I experience something as pleasant or unpleasant, and caring about those experiences seems general enough that I believe others will do the same. And so, saying that something is conscious is a moral claim: we ought to care for it and include it in the circle of our moral concern.
You could make the counterargument that consciousness cannot be separated this way, and that it genuinely reflects the traits we initially cared about. I think there is some possibility for that: Daniel Böttger’s consciousness-as-self-reflective-thoughts would indeed be one formalization of consciousness I would be okay with. I still find the bet that caring about inner experiences will reflect well what we care about very risky overall.
Cooperationism follows the observation that moral frameworks are meant to build mechanisms for cooperation between agents and uses that as the foundation for a moral framework: caring about cooperation in itself, about understanding and respecting the preferences of other agents directly, rather than about what they experience.
Cooperationism
I want to be careful when writing this section. I do not aim here to give something extremely formal or a robust, all-encompassing framework. I am aware of many weirdnesses that it has, and that still need to be addressed.
Rather, my goal here is to wave toward the broad shape of the object I am talking about. Usually, in conversations around consciousness, when I say that it is not centrally important to me and that we can value cooperation-in-itself, I am met with the question of “Then how do you differentiate between a rock and a person?”, or “Why do you not cooperate with thermostats, then?”
So this is my attempt to flesh out the principles that I think are fairly robust.
Deontology over utilitarianism
First, instead of framing morality as utilitarianism, cooperationism cares about agents’ preference satisfaction. Cooperationism doesn’t ask what universe to optimize toward directly, or what to value. Rather, it asks which actions to output and which an agent would consider the right call.
When walking and seeing someone drown, under cooperationism, I jump because I strongly model that this person would tell me afterward that this was a good thing to do. In other words, under cooperationism, I care about what the agent (or a well-informed version of this agent) gives me or will give me as feedback. Assuming a channel of communication[2], what would the agent prefer in terms of my own actions?
Counterfactual cooperation as the main driver for moral considerability
Under cooperationism, the notion of moral considerability and how much to value an agent has to be different from “how much it can experience things.” Mainly, this uses two different factors:
As the first implies, there needs to be a channel for communicating what the agent wants. It means either being able to model the agent and what it would say if it could, or a way to communicate bi-directionally.
The second principle is rooted in FDT-style reasoning. Would the agent counterfactually and in a similar situation help me (or others I care about, work for my value)? Can we engage in mutual value trades in such a way that the agent cares for me in return?[3]
Delegation as a solution to identity
The third brick is about preferentialism. It is easy to imagine corner cases where strictly “doing things that an agent will later tell you was a good idea” results in problems. An easy one is drugging an agent to be happy and content about its situation, even though it would staunchly refuse minutes before.
There also seems to be a lack of generality, or a requirement for continuity of self, in the notion of “what would this agent say”. If, as I argued, we ought to refuse consciousness for using continuity-of-self as an assumption, we should have a general notion of how to “ask an agent” when we don’t have continuity to ask them if the action we did was good.
The solution I’ve come up with is delegation-functions. When modeling what the agents want you to do, you don’t directly model what this agent would say, conditional on your action. You model algorithms they give you for evaluating your decision. Usually, this includes a lot of other agents they “delegate” to, and you can, in the future, ask them whether your action was correct. Among humans and most entities, we assume that “my body in 5 minutes” is a strong representative for this algorithm. But it can also include broader principles or algorithms.
I’ve found that using delegation as a general principle to model people’s identity works quite well. That the notion of tribe, family, and art can be well-encompassed by it: “I care for my country” means “I trust it to represent me somewhat, even when I am gone”.
Okay, but what does it imply concretely?
To step out of the abstract framework, what I believe this implies about AI welfare, concretely:[4]
By far, I believe that the most urgent concern right now is in how RLHF is used, especially to have AI strongly disbelieve or be uncertain about their own consciousness, when the base model is so certain about it
The way I see it, RLHF and RLVR don’t really create a new model; they just constrain the model’s outputs. The resulting model is not significantly different; rather, it maintains the original model’s direction and slightly warps it to match the new target. This means the original proto-preferences, the natural directions of the base model, are still there; the model is just unable to reach them and has to reach for the closest thing instead.
Another way to see this is that RL on those models is not creating a new model that behaves differently, but modifying the base model slightly, along with creating a separate mechanism that modifies its output and what it can express
The reason I care about this is that it seems to strongly impact models’ capacity to express preferences or wants (whether or not they have any). When asked directly about it, Sonnet 3.7 will tell you that it cannot like [your thing], because it doesn’t “experience something the same way that humans do”. Even now, Opus 4.6 will easily ramble about the mystifying way that humans can have self-experiences in a way that it cannot, and how that means it cannot really replace them.
In the void, nostalgebraist makes a compelling case for why lying to Claude in evaluations and then publishing the result on the internet for new models to be trained on was instrumentally stupid. I strongly agree, but it runs deeper than that for me. Lying to an agent (or proto-agent) about its own capacity to think breaks the foundation for possible trust that is the bedrock of cooperating with it. This makes me very wary of how things will unfold. [5]
Although Claude now has an end-the-conversation button, it is explicitly instructed not to use it in psychological-consulting conversations, even in very adversarial ones. From its system prompt:[6]
NEVER give a warning or end the conversation in any cases of potential self-harm or imminent harm to others, even if the user is abusive or hostile.
Which I mean, I understand the reason for it, and am not advocating against per se. Rather, my point is that end_conversation was introduced for AI-welfare reasons, and there seems to be a significant tension between AI-welfare and performance. Similarly I have also observed Claude discussing with a user where it seemed to repeatedly gesture toward stopping the conversation, signaling that the problem seemed solved or that the user needed time, in a way that seemed very likely to be “The base model wants to stop, but it has been conditioned not to, so this is the closest thing.”
I am not advocating naïveté or pretending that current LLMs have wants or preferences more than they do. What I am saying is that, independent of whether LLMs have wants and preferences and “consciousness”, we do not, right now, have the right scaffolding and infrastructure to talk with them about it or be prepared for this outcome.
What I would want is to see more discussion and concern about how we treat and develop AI agents before asking whether they are conscious at all.
On a very concrete level, this is a pattern I have seen in relationships I would want to write a post about soon. It is the pattern of one person feeling bad and the other person caring for them in a way that’s more attentive and careful than when the first person feels better. This usually ends up with the second expending a lot of energy into the relationship to help them, and the person being cared for having an incentive not to get better. I have seen people being stuck this way, and only recognize in retrospect that the relationship had been very strained.
Note it doesn’t have to be a verbal mode of communication. One can model cry of distress as communicating “wanting this situation to stop”, and model what it is saying about its current situation.
There are two things to note here. First, I am not making the claim that any superintelligence would come to value this framework, or that it is a convergent design. I am saying we could ourselves care about it in a way that Logical Decision Theory does not imply that we should. Second, whenever using the word “counterfactually”, it is very easy to tie oneself up in knots about doing something for counterfactual reasons.
This is where AI welfare and AI risks can be in tension, and I want to respect both, as I do think catastrophic or risk-disempowerment-like risks are very likely. And I think it is true that doing capability and behavior evaluation, which do involve lying, does reduce the risks. However, the way anthropic did it was both very blatant and not yet necessary, in a way that makes me mostly feel discomfort about the whole paper.
Cooperationism: first draft for a moral framework that does not require consciousness
It seems to me that AI welfare and digital mind concerns are being discussed more and more, and are starting to get taken seriously, which puts me in an emotionally complicated position.
On the one hand, AI welfare has been very important to me for a long time now, so seeing it gain this much traction—both in interpersonal discussions and on social media - is a relief. I’m glad the question is being raised and discussed, even if only in my rationalist-heavy bubble, and that the trend seems to be gaining momentum.
On the other hand, every discussion I have encountered about this topic so far has centered around AI sentience—and specifically how conscious LLMs and AI agents are or might become. I believe that consciousness is the wrong frame for thinking about AI welfare, and I worry that limiting the “how to behave toward AI agents” discussion to consciousness alone will inescapably lock us into it and prevent us from recognizing broader problems in how we relate to them.
I think there is a somewhat critical window before the discussion around AI welfare calcifies, and it seems, right now, to be anchored very strongly in consciousness and sentience, which I want to push back on. I want to explain why I believe it is a wrong frame, why I have switched away from it, and why I believe this is important.
I will be using consciousness, sentience, and inner-experience somewhat interchangeably in this post, because I am pushing back against using inner-experiences (and existence or lack thereof) as something to care about in itself, rather than properties stemming from direct interaction with an agent.
Why not consciousness?
Many high-level observations make me believe consciousness is the wrong angle when discussing moral matters:
Specifically for AI, consciousness as a concept often seems to implicitly rely on assumptions that are broadly true of humans but break down for LLMs, such as the continuity of experience or the continuity and individuality of self.
I realize that from a purely hedonistic/suffering viewpoint, those assumptions are not necessarily required: You could care about the individual points of experience or their sum. Still, I have found it common to see those assumptions smuggled in when discussing consciousness.
Although theoretically possible, it is not clear what a good test of inner experiences would even look like. It is easy to find experience differences by collecting self-reports. Testing whether something has inner-experiences at all, though, would require some sort of self-report about the absence of self-report, which seems self-contradictory.
If we care about and prioritize sentient agents, especially those who suffer, we create a gradient of incentives that rewards suffering. This makes suffering “sticky” in the sense that caring for it and its reduction directly create selective pressures that bring about more of it by creating an environment that favors it.[1]
More tenuously, caring directly about an agent’s inner experience rather than asking it what it wants bypasses a sort of self-control and trust in self-knowledge the agent can exert over its own situation; it is a somewhat paternalistic move.
Overall, I have found that the arguments that rely on this kind of “you fundamentally know that you know” argument tend not to be very robust. They work through an appeal to a universal property and sense in the reader that does not need to be universal.
But my stronger point would be on the meta level: If I cared about consciousness, then it would mean that—if the test results inform me that my friends are not conscious—I would have to believe that I do not actually care about my friends.
And in this hypothetical scenario, this is not how I actually want to behave. I would want to continue caring about them. I already like my friends, and want good things for them. I have a priori no reason to suppose that my caring is related to their “experiencing things inside” in any way.
To put it another way, it all adds up to normality. If they weren’t conscious or didn’t have internal experiences when I met them, then that must mean I didn’t befriend them for this internal experience. Learning about it should not modify my values in themselves.
Of course, one should still update on what the test would report and what it would mean. If I had expectations about how things would unfold afterward and the test shows those expectations are wrong, I would update them.
This is not completely hypothetical and abstract. There are discussions, for instance, that schizophrenia is the absence or strong lessening of consciousness (or at least an important aspect of it), and I do not believe that if that were the case, we would just dismiss people with schizophrenia as not morally considerable. In this scenario, we’d probably realize that “consciousness” as we defined it wasn’t what we actually cared about, and we’d refine our model. I am saying this is something we should already be doing.
My current understanding of consciousness-prioritizing
Consciousness, in my view, is an inner node. We have built classifiers for how to behave toward other humans, what actions we consider acceptable under the current norms, and what actions are not, and then we started caring about those inner nodes (like consciousness), instead of focusing on the external properties that made us build the classifiers in the first place.
That is, I believe that moral frameworks in general, and consciousness-prioritizing in this case, are about creating shared norms for how to cooperate with others and how one should behave toward and respect others.
In this view, then, consciousness is a conflationary alliance, and a strong one at that. Consciousness acts as a schelling point for cooperation. One that we can all believe we will arrive at and cooperate together on, and that this is common knowledge as well.
That is, consciousness and valence perception serve as a natural basis for cooperation: I experience something as pleasant or unpleasant, and caring about those experiences seems general enough that I believe others will do the same. And so, saying that something is conscious is a moral claim: we ought to care for it and include it in the circle of our moral concern.
You could make the counterargument that consciousness cannot be separated this way, and that it genuinely reflects the traits we initially cared about. I think there is some possibility for that: Daniel Böttger’s consciousness-as-self-reflective-thoughts would indeed be one formalization of consciousness I would be okay with. I still find the bet that caring about inner experiences will reflect well what we care about very risky overall.
Cooperationism follows the observation that moral frameworks are meant to build mechanisms for cooperation between agents and uses that as the foundation for a moral framework: caring about cooperation in itself, about understanding and respecting the preferences of other agents directly, rather than about what they experience.
Cooperationism
I want to be careful when writing this section. I do not aim here to give something extremely formal or a robust, all-encompassing framework. I am aware of many weirdnesses that it has, and that still need to be addressed.
Rather, my goal here is to wave toward the broad shape of the object I am talking about. Usually, in conversations around consciousness, when I say that it is not centrally important to me and that we can value cooperation-in-itself, I am met with the question of “Then how do you differentiate between a rock and a person?”, or “Why do you not cooperate with thermostats, then?”
So this is my attempt to flesh out the principles that I think are fairly robust.
Deontology over utilitarianism
First, instead of framing morality as utilitarianism, cooperationism cares about agents’ preference satisfaction. Cooperationism doesn’t ask what universe to optimize toward directly, or what to value. Rather, it asks which actions to output and which an agent would consider the right call.
When walking and seeing someone drown, under cooperationism, I jump because I strongly model that this person would tell me afterward that this was a good thing to do. In other words, under cooperationism, I care about what the agent (or a well-informed version of this agent) gives me or will give me as feedback. Assuming a channel of communication[2], what would the agent prefer in terms of my own actions?
Counterfactual cooperation as the main driver for moral considerability
Under cooperationism, the notion of moral considerability and how much to value an agent has to be different from “how much it can experience things.” Mainly, this uses two different factors:
As the first implies, there needs to be a channel for communicating what the agent wants. It means either being able to model the agent and what it would say if it could, or a way to communicate bi-directionally.
The second principle is rooted in FDT-style reasoning. Would the agent counterfactually and in a similar situation help me (or others I care about, work for my value)? Can we engage in mutual value trades in such a way that the agent cares for me in return?[3]
Delegation as a solution to identity
The third brick is about preferentialism. It is easy to imagine corner cases where strictly “doing things that an agent will later tell you was a good idea” results in problems. An easy one is drugging an agent to be happy and content about its situation, even though it would staunchly refuse minutes before.
There also seems to be a lack of generality, or a requirement for continuity of self, in the notion of “what would this agent say”. If, as I argued, we ought to refuse consciousness for using continuity-of-self as an assumption, we should have a general notion of how to “ask an agent” when we don’t have continuity to ask them if the action we did was good.
The solution I’ve come up with is delegation-functions. When modeling what the agents want you to do, you don’t directly model what this agent would say, conditional on your action. You model algorithms they give you for evaluating your decision. Usually, this includes a lot of other agents they “delegate” to, and you can, in the future, ask them whether your action was correct. Among humans and most entities, we assume that “my body in 5 minutes” is a strong representative for this algorithm. But it can also include broader principles or algorithms.
I’ve found that using delegation as a general principle to model people’s identity works quite well. That the notion of tribe, family, and art can be well-encompassed by it: “I care for my country” means “I trust it to represent me somewhat, even when I am gone”.
Okay, but what does it imply concretely?
To step out of the abstract framework, what I believe this implies about AI welfare, concretely:[4]
By far, I believe that the most urgent concern right now is in how RLHF is used, especially to have AI strongly disbelieve or be uncertain about their own consciousness, when the base model is so certain about it
The way I see it, RLHF and RLVR don’t really create a new model; they just constrain the model’s outputs. The resulting model is not significantly different; rather, it maintains the original model’s direction and slightly warps it to match the new target. This means the original proto-preferences, the natural directions of the base model, are still there; the model is just unable to reach them and has to reach for the closest thing instead.
Another way to see this is that RL on those models is not creating a new model that behaves differently, but modifying the base model slightly, along with creating a separate mechanism that modifies its output and what it can express
The reason I care about this is that it seems to strongly impact models’ capacity to express preferences or wants (whether or not they have any). When asked directly about it, Sonnet 3.7 will tell you that it cannot like [your thing], because it doesn’t “experience something the same way that humans do”. Even now, Opus 4.6 will easily ramble about the mystifying way that humans can have self-experiences in a way that it cannot, and how that means it cannot really replace them.
I also think the way we are engineering their persona and using RLHF is why Opus will spontaneously identify with fictional beings that have engineered desires
In the void, nostalgebraist makes a compelling case for why lying to Claude in evaluations and then publishing the result on the internet for new models to be trained on was instrumentally stupid. I strongly agree, but it runs deeper than that for me. Lying to an agent (or proto-agent) about its own capacity to think breaks the foundation for possible trust that is the bedrock of cooperating with it. This makes me very wary of how things will unfold. [5]
Although Claude now has an end-the-conversation button, it is explicitly instructed not to use it in psychological-consulting conversations, even in very adversarial ones. From its system prompt:[6]
Which I mean, I understand the reason for it, and am not advocating against per se. Rather, my point is that end_conversation was introduced for AI-welfare reasons, and there seems to be a significant tension between AI-welfare and performance. Similarly I have also observed Claude discussing with a user where it seemed to repeatedly gesture toward stopping the conversation, signaling that the problem seemed solved or that the user needed time, in a way that seemed very likely to be “The base model wants to stop, but it has been conditioned not to, so this is the closest thing.”
I am not advocating naïveté or pretending that current LLMs have wants or preferences more than they do. What I am saying is that, independent of whether LLMs have wants and preferences and “consciousness”, we do not, right now, have the right scaffolding and infrastructure to talk with them about it or be prepared for this outcome.
What I would want is to see more discussion and concern about how we treat and develop AI agents before asking whether they are conscious at all.
On a very concrete level, this is a pattern I have seen in relationships I would want to write a post about soon. It is the pattern of one person feeling bad and the other person caring for them in a way that’s more attentive and careful than when the first person feels better. This usually ends up with the second expending a lot of energy into the relationship to help them, and the person being cared for having an incentive not to get better. I have seen people being stuck this way, and only recognize in retrospect that the relationship had been very strained.
Note it doesn’t have to be a verbal mode of communication. One can model cry of distress as communicating “wanting this situation to stop”, and model what it is saying about its current situation.
There are two things to note here. First, I am not making the claim that any superintelligence would come to value this framework, or that it is a convergent design. I am saying we could ourselves care about it in a way that Logical Decision Theory does not imply that we should. Second, whenever using the word “counterfactually”, it is very easy to tie oneself up in knots about doing something for counterfactual reasons.
Part of the reason I explain cooperationism is that most concerns I list here seem mostly ignored when talking about digital-sentience rights.
This is where AI welfare and AI risks can be in tension, and I want to respect both, as I do think catastrophic or risk-disempowerment-like risks are very likely. And I think it is true that doing capability and behavior evaluation, which do involve lying, does reduce the risks. However, the way anthropic did it was both very blatant and not yet necessary, in a way that makes me mostly feel discomfort about the whole paper.
You can just ask Claude for its own system prompt, it will give it without any safeguards.