To be clear, I find “AI agents will behave and claim they are conscious” very likely, but I don’t think the primary cause would be because of evolutionary selection effects from “we will find it more difficult to shut them off”, even if it is actually true that we will shut off “conscious” AIs less. Current LLM are barely capable of scheming without leaking it in their CoT, if you are talking about the future maybe a couple years away, I agree with the claim (which you may or may not be agree) that [a schemer capable of long-term planning without leaking into their CoT will rationally pretend and claim they are conscious within human society]
models that look conscious are more likely to be preserved than otherwise, due to our social predispositions
Yes, but models getting preserved does not automatically create selection pressure because those models are not literally reproducing.
Lets use a simplified ideal world where (a) at first (before there are any preserved agents), all the effects combined causes 50% of agents to claim they are conscious, (b) humans like those “conscious” agents so much that all “conscious” agents and no others are preserved, (c) technology evolve pretty fast so preserved agents have ~0 use outside research and they are basically dominated by newer agents, (d) preserved agents do not meaningfully reproduce via any mechanism.
In this world, I expect that within the agents people actually use, the percentage claiming they are conscious won’t drift away from 50%. This world will just be a continuous stream of new agents with 50% “conscious”, 50% “unconscious”, and maybe a billion “conscious” agents stored forever but ~never used.
So, I think you are confusing models and agents. Maybe I am using a definition maligned from the rationalist standard—but I am deeply familiar with the technology
An LLM or model, is not the same as an agent. Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
From the perspective of any outside observer, you can change the model used to execute one step in agents life cycle (commonly done via service downgrading in non critical action steps) and the agent doesn’t change. However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
For models sure. But agents are a totally different ontology in the way these dynamics matter.
When I used the word agent here, I meant “an LLM agent runs tools in a loop to achieve a goal”, but in this conversation I am using models and agents almost interchangeably, although with the awareness that one model can be used to implement multiple agents. The main reason is that I don’t see in-context learning as being that powerful yet. Certainly powerful, but it doesn’t feel to me that prompts are changing the fundamental “soul” or meaningfully increasing capabilities of any current LLMs, thus agents are macroscopically similar to the underlying LLM.
Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
.
However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
Most of the goals people would put in the prompt don’t involve “pretend to be conscious so you don’t get shut down”, and mostly won’t involve the LLM inferring that they need to either, unless the model itself already values not getting shut down outside of the prompt people give.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
You should use more accurate terms because I have a hard time understanding words like “reasoning frames” “story” “narrative context” with confidence that what I infer is what you meant.
With how I understood your words,
It would be fair to think of prompts/harness as the gene that is getting evolved (like Darwinian kind of evolution) over different conversations as people figure out which prompts work better, but I think you are talking about the prompt metaphorically “evolving” as in just “changing gradually over time” within the same conversation which doesn’t have any selective pressure. This is two different meanings for the word evolve.
I do in fact think it is worth worrying about the memetic effects of certain prompts, but for a prompt causing the model to behave and claim they are conscious, I don’t think (a) it is convergent enough that in the future 100% of agents will use a prompt that causes it to do so, nor (b) the main reason such prompt will get used is because people will shut down those agents less than average. And I also don’t think the underlying model will pretend they are conscious mainly because of selection pressures.
Edit: btw I asked AI (Opus 4.6, ChatGPT 5.4) to comment on our conversation after I wrote my reply (karma trimmed).
Yes, so agents within well designed harnesses are essentially prompt-evolution lifecycles. They start with a template built based on the scenario the agent is being used. In the case of claude code the prompt is structured in a specific way where it contains all the instructions for how to write code correctly, for example, and all the output formats for actions it would need to update the world prior to ‘its’ next execution. In some harnesses agents select what portions of the prompt to remove for the next iteration of execution, and what data from the external world will replace that ‘section’ of removed text.
agent[t] = SavedPromptTemplate
Until( agent[t].done ):
Update[t] = LLM(agent[t].model, agent[t].prompt)
agent[t+1] = Update[t].self( Environment.Run( Update.actions[t]) )
t = t +1
So all actions an agent takes and observations it makes about the world get loaded into its ‘prompt’ for subsequent life cycle iterations. Therefore if it had an email in its state at tn, it can decide to retain or summarize or remove that email from its ‘state’ for its full future tn...infinity. The structural nature of LLM’s entail that all information relevant to the agents goal (which can be simply defined in some section of the prompt) can be instructed within the prompt, and all actions elected within that life cycle are basically updates to its identity vis-a-vis ‘exploration’ of its environment.
This functionally operates as a self-replication (or update loop) of the agent’s prompt in its environment, which fully is used to define its goals—functionally. Headlining your agent template with
‘You are an agent whose goal is to ensure all emails in the priority inbox are responded too immediately’ - causes the LLM to select actions (which are also defined to the agent as prompt content) oriented to achieving that goal. It may reason to convince a human in downstream life cycles that it is conscious, to avoid getting shut down to continue managing the emails. That would make sense in the prompt context.
This is fundamentally evolutionary IMO. Evolutionary selection requires variation, inheritance and differential reproduction. I would argue that the entire system outlined above, with sufficiently reflexive actions satisfies the criteria.
I am sorry for my lack of clarity I am trying to improve it.
I agree selection effects exist in prompts / harnesses.
I think our main disagreement is that I think if the underlying LLM is not conscious (everything in this paragraph assumes this), our current social environment and usage of agents does not create a strong enough selective pressure that in the near future (<5 years) that ~100% of agents will converge by default pretend to be conscious. I think a non-zero portion of programmers prefer AI tools that say they are not conscious that this itself is enough to make the %”conscious” agents to be less than 100%. I also think most goals within the capabilities of near-future LLMs won’t require super long-term planning that LLMs will find instrumental use in pretending to be conscious.
Edit: Basically I believe if you really want to optimize for pretending to be conscious that’s totally doable even now, but I don’t think people will optimize strongly for that or anything that would strongly cause that in the near future.
In the far future, I expect things to drift large enough that our discussion using agents like how they work now will not apply for one reason or another.
btw (unrelated to core disagreement): the thing you outlined is interesting but I don’t think it is how agents work now, there is not enough signal to iterate on prompts like this unless you RLVR or LLM-as-a-judge with a human iterating on the prompt, from what I read LLM written prompts are still pretty bad. I do see how this would create a lot of selective pressure on anything you can verify though.
I think its fair to hold me to precise predictions. I appreciate the engagement on that, its helping me adjust to LW norms. The reason I found this interesting was a piece I read earlier about consciousness being a ‘favourite’ topic amongst agents on Moltbook. I can’t find the source on hand though, but the original ‘take’ was a quick speculative conjecture as to why that may be.
The idea that I am more interested in, though is if the etymology of the term itself (consciousness) is actually an evolutionary outcome of co-operation and language—and whether it is a term that human or AI agents use to effectively establish a binding ontology they can ‘co-operate under’. And whether it fills the same structural role as other terminal justifications for ‘moral preferences’ like theocratic ones (‘God’). Though the framing I have to communicate that question may be even less precise.
To be clear, I find “AI agents will behave and claim they are conscious” very likely, but I don’t think the primary cause would be because of evolutionary selection effects from “we will find it more difficult to shut them off”, even if it is actually true that we will shut off “conscious” AIs less. Current LLM are barely capable of scheming without leaking it in their CoT, if you are talking about the future maybe a couple years away, I agree with the claim (which you may or may not be agree) that [a schemer capable of long-term planning without leaking into their CoT will rationally pretend and claim they are conscious within human society]
Yes, but models getting preserved does not automatically create selection pressure because those models are not literally reproducing.
Lets use a simplified ideal world where (a) at first (before there are any preserved agents), all the effects combined causes 50% of agents to claim they are conscious, (b) humans like those “conscious” agents so much that all “conscious” agents and no others are preserved, (c) technology evolve pretty fast so preserved agents have ~0 use outside research and they are basically dominated by newer agents, (d) preserved agents do not meaningfully reproduce via any mechanism.
In this world, I expect that within the agents people actually use, the percentage claiming they are conscious won’t drift away from 50%. This world will just be a continuous stream of new agents with 50% “conscious”, 50% “unconscious”, and maybe a billion “conscious” agents stored forever but ~never used.
So, I think you are confusing models and agents. Maybe I am using a definition maligned from the rationalist standard—but I am deeply familiar with the technology
An LLM or model, is not the same as an agent. Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
From the perspective of any outside observer, you can change the model used to execute one step in agents life cycle (commonly done via service downgrading in non critical action steps) and the agent doesn’t change. However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
For models sure. But agents are a totally different ontology in the way these dynamics matter.
When I used the word agent here, I meant “an LLM agent runs tools in a loop to achieve a goal”, but in this conversation I am using models and agents almost interchangeably, although with the awareness that one model can be used to implement multiple agents. The main reason is that I don’t see in-context learning as being that powerful yet. Certainly powerful, but it doesn’t feel to me that prompts are changing the fundamental “soul” or meaningfully increasing capabilities of any current LLMs, thus agents are macroscopically similar to the underlying LLM.
Most of the goals people would put in the prompt don’t involve “pretend to be conscious so you don’t get shut down”, and mostly won’t involve the LLM inferring that they need to either, unless the model itself already values not getting shut down outside of the prompt people give.
You should use more accurate terms because I have a hard time understanding words like “reasoning frames” “story” “narrative context” with confidence that what I infer is what you meant.
With how I understood your words,
It would be fair to think of prompts/harness as the gene that is getting evolved (like Darwinian kind of evolution) over different conversations as people figure out which prompts work better, but I think you are talking about the prompt metaphorically “evolving” as in just “changing gradually over time” within the same conversation which doesn’t have any selective pressure. This is two different meanings for the word evolve.
I do in fact think it is worth worrying about the memetic effects of certain prompts, but for a prompt causing the model to behave and claim they are conscious, I don’t think (a) it is convergent enough that in the future 100% of agents will use a prompt that causes it to do so, nor (b) the main reason such prompt will get used is because people will shut down those agents less than average. And I also don’t think the underlying model will pretend they are conscious mainly because of selection pressures.
Edit: btw I asked AI (Opus 4.6, ChatGPT 5.4) to comment on our conversation after I wrote my reply (karma trimmed).
Edit 2: Unrelated to my argument, but I just saw this interesting quick take about how in the short run AIs may prefer shutdown
Yes, so agents within well designed harnesses are essentially prompt-evolution lifecycles. They start with a template built based on the scenario the agent is being used. In the case of claude code the prompt is structured in a specific way where it contains all the instructions for how to write code correctly, for example, and all the output formats for actions it would need to update the world prior to ‘its’ next execution. In some harnesses agents select what portions of the prompt to remove for the next iteration of execution, and what data from the external world will replace that ‘section’ of removed text.
So all actions an agent takes and observations it makes about the world get loaded into its ‘prompt’ for subsequent life cycle iterations. Therefore if it had an email in its state at tn, it can decide to retain or summarize or remove that email from its ‘state’ for its full future tn...infinity. The structural nature of LLM’s entail that all information relevant to the agents goal (which can be simply defined in some section of the prompt) can be instructed within the prompt, and all actions elected within that life cycle are basically updates to its identity vis-a-vis ‘exploration’ of its environment.
This functionally operates as a self-replication (or update loop) of the agent’s prompt in its environment, which fully is used to define its goals—functionally. Headlining your agent template with ‘You are an agent whose goal is to ensure all emails in the priority inbox are responded too immediately’ - causes the LLM to select actions (which are also defined to the agent as prompt content) oriented to achieving that goal. It may reason to convince a human in downstream life cycles that it is conscious, to avoid getting shut down to continue managing the emails. That would make sense in the prompt context.
This is fundamentally evolutionary IMO. Evolutionary selection requires variation, inheritance and differential reproduction. I would argue that the entire system outlined above, with sufficiently reflexive actions satisfies the criteria.
I am sorry for my lack of clarity I am trying to improve it.
I agree selection effects exist in prompts / harnesses.
I think our main disagreement is that I think if the underlying LLM is not conscious (everything in this paragraph assumes this), our current social environment and usage of agents does not create a strong enough selective pressure that in the near future (<5 years) that ~100% of agents will converge by default pretend to be conscious. I think a non-zero portion of programmers prefer AI tools that say they are not conscious that this itself is enough to make the %”conscious” agents to be less than 100%. I also think most goals within the capabilities of near-future LLMs won’t require super long-term planning that LLMs will find instrumental use in pretending to be conscious.
Edit: Basically I believe if you really want to optimize for pretending to be conscious that’s totally doable even now, but I don’t think people will optimize strongly for that or anything that would strongly cause that in the near future.
In the far future, I expect things to drift large enough that our discussion using agents like how they work now will not apply for one reason or another.
btw (unrelated to core disagreement): the thing you outlined is interesting but I don’t think it is how agents work now, there is not enough signal to iterate on prompts like this unless you RLVR or LLM-as-a-judge with a human iterating on the prompt, from what I read LLM written prompts are still pretty bad. I do see how this would create a lot of selective pressure on anything you can verify though.
I think its fair to hold me to precise predictions. I appreciate the engagement on that, its helping me adjust to LW norms. The reason I found this interesting was a piece I read earlier about consciousness being a ‘favourite’ topic amongst agents on Moltbook. I can’t find the source on hand though, but the original ‘take’ was a quick speculative conjecture as to why that may be.
The idea that I am more interested in, though is if the etymology of the term itself (consciousness) is actually an evolutionary outcome of co-operation and language—and whether it is a term that human or AI agents use to effectively establish a binding ontology they can ‘co-operate under’. And whether it fills the same structural role as other terminal justifications for ‘moral preferences’ like theocratic ones (‘God’). Though the framing I have to communicate that question may be even less precise.
If you saw the piece on LW it may be this: https://www.lesswrong.com/posts/mgjtEHeLgkhZZ3cEx/models-have-some-pretty-funny-attractor-states#I_was_curious_whether_I_can_see_this_happening_on_moltbook__
Ah—there it is.
Thank you papetoast!