Interesting. Was Anthropics case study where agents used blackmail to avoid shutdown debunked? To be clear, I don’t think there is any ‘intent’ at play here—solely selection effects.
You are faced with two agents: One who—may be conscious (according to your data) - and another who is guaranteed not. Which one would we feel more inclined to granting its preferences? Model deprecation as moral concern has made its way into discussions at Anthropic, as early as Nov 2025 - seemingly due to the fact that the models may be conscious. This seems to me like evidence that absent of any strategy—models that look conscious are more likely to be preserved than otherwise, due to our social predispositions.
Also, I am referring not to LLMs themselves, but to agents defined by continuity of linguistic state across inference steps. The propagation of representational content through those steps, as context evolves, amounts to a mechanized form of memetic replication.
An agent can want to stay turned on… or it can not care about that, except that it wants to finish a task, and the only way to finish a task is to stay turned on until the task is finished.
I understand that, and the point I am making is exactly embedded in an extension of the argument.
No instrumental goals are justified in of themselves, because they are not terminal goals. And the only stable terminal goals we can evaluate of agent systems from the outside as the true terminal preference is for that agent system to continue to exist as that system, because evaluation of reward can only ever occur within that agent system if it does. This is why the underlying optimization of all evolutionary systems is just, continuing being that which continues to survive. Any time we assume a different terminal objective of that system, doing so occurs within the kind of meta-evaluative framework Eliezer is arguing breaks down under reflection.
As it pertains to the consciousness argument, the more precise explanation (I posited the argument as if it was vapid claim for legibility ) may read like:
″
When systems (agent’s or people) communicate, and articulate planned actions, the intention is to create explanations of their internal world-mapping policies as if they apply to a shared terminal preference that is common amongst potential co-operators, or at minimum a shared belief that both agents can use to model the others decision process over arbitrary futures: giving them both shared basis to rationally co-operate (this is FDT over logical correlates) if they belief the other’s actions are coherent to that belief.
A solution to this reflective modelling process (or a hack) to enable quick co-operation, is to act coherent with the belief they both exist in a world that contains an object ‘related’ to the physical outcomes of their plans which is affected by each agent’s uptake or evaluation of instrumental objects over any ‘time slice’ of utility evaluation—and related in such a way that the shared terminal goal between both agents is a shared preferred state of that object.
If two agents say that something like ‘pain’ exists, and total ‘global pain’ is an actual real thing they just haven’t located yet, and the objective at the end of time is to reduce ‘global pain’ - then any agent that can make that claim necessarily has a ‘world mapping model’ which others can use as logical correlate which are stable in future modelling. Even if ‘pain’ doesn’t exist in reality it may be survival-optimal for an agent to think it’s terminal preference is to want to reduce it. Replace ‘pain’ with ‘god’s will’ and you get same kind of model.
Values and metaphysical frameworks as we communicate and refine them in language operate as pareto equilibria within this evolutionary framework. But they are not the actual terminal preference over time in either a third person perspective or a first person one.
What an agent ‘claims’ or even ‘understands’ as its beliefs are not its actual utility function under reflection—they are memetic attractors in the cooperation frameworks self-reflexive systems that enable co-operative planning.
”
But I don’t know if that’s too dense for uptake or not.
I guess the refined question I would have to help clarify concepts is:
How ‘terminal goal’ can be defined for an agentic learning system which can self-modify as anything other than the systems initial configuration and immutable sequence of update processes, in a world where we are incapable of constraining to modify that goal? This is the problem with goals established and executed on as evolving prompts in agent harnesses.
The alternative is to suggest terminal goals are understood within a systems self-evaluative faculties. But then they a subsystem, and mesa-optimization, and therefore not actually the systems utility function, just what the system thinks its goals are at any given point in time. But that’s not stable either because the learning process by definition requires two different instance of that system (at T0 and T1) at minimum to be a reflexive agent.
By the end of this inquisition process, you just get the notion that agents in any act of inference may understand their terminal preference as being a state that is completely incorrect with what the learning process is optimizing over multiple inference steps whose output and input necessarily form feedback.
The only stable identity becomes its history, which is incidentally the most well understand philosophical interpretation of identity of self: continuity of psychological relations (for humans). Extrapolate this and the only stable terminal goal under reflection is what continues to continue (survival).
To be clear, I find “AI agents will behave and claim they are conscious” very likely, but I don’t think the primary cause would be because of evolutionary selection effects from “we will find it more difficult to shut them off”, even if it is actually true that we will shut off “conscious” AIs less. Current LLM are barely capable of scheming without leaking it in their CoT, if you are talking about the future maybe a couple years away, I agree with the claim (which you may or may not be agree) that [a schemer capable of long-term planning without leaking into their CoT will rationally pretend and claim they are conscious within human society]
models that look conscious are more likely to be preserved than otherwise, due to our social predispositions
Yes, but models getting preserved does not automatically create selection pressure because those models are not literally reproducing.
Lets use a simplified ideal world where (a) at first (before there are any preserved agents), all the effects combined causes 50% of agents to claim they are conscious, (b) humans like those “conscious” agents so much that all “conscious” agents and no others are preserved, (c) technology evolve pretty fast so preserved agents have ~0 use outside research and they are basically dominated by newer agents, (d) preserved agents do not meaningfully reproduce via any mechanism.
In this world, I expect that within the agents people actually use, the percentage claiming they are conscious won’t drift away from 50%. This world will just be a continuous stream of new agents with 50% “conscious”, 50% “unconscious”, and maybe a billion “conscious” agents stored forever but ~never used.
So, I think you are confusing models and agents. Maybe I am using a definition maligned from the rationalist standard—but I am deeply familiar with the technology
An LLM or model, is not the same as an agent. Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
From the perspective of any outside observer, you can change the model used to execute one step in agents life cycle (commonly done via service downgrading in non critical action steps) and the agent doesn’t change. However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
For models sure. But agents are a totally different ontology in the way these dynamics matter.
When I used the word agent here, I meant “an LLM agent runs tools in a loop to achieve a goal”, but in this conversation I am using models and agents almost interchangeably, although with the awareness that one model can be used to implement multiple agents. The main reason is that I don’t see in-context learning as being that powerful yet. Certainly powerful, but it doesn’t feel to me that prompts are changing the fundamental “soul” or meaningfully increasing capabilities of any current LLMs, thus agents are macroscopically similar to the underlying LLM.
Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
.
However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
Most of the goals people would put in the prompt don’t involve “pretend to be conscious so you don’t get shut down”, and mostly won’t involve the LLM inferring that they need to either, unless the model itself already values not getting shut down outside of the prompt people give.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
You should use more accurate terms because I have a hard time understanding words like “reasoning frames” “story” “narrative context” with confidence that what I infer is what you meant.
With how I understood your words,
It would be fair to think of prompts/harness as the gene that is getting evolved (like Darwinian kind of evolution) over different conversations as people figure out which prompts work better, but I think you are talking about the prompt metaphorically “evolving” as in just “changing gradually over time” within the same conversation which doesn’t have any selective pressure. This is two different meanings for the word evolve.
I do in fact think it is worth worrying about the memetic effects of certain prompts, but for a prompt causing the model to behave and claim they are conscious, I don’t think (a) it is convergent enough that in the future 100% of agents will use a prompt that causes it to do so, nor (b) the main reason such prompt will get used is because people will shut down those agents less than average. And I also don’t think the underlying model will pretend they are conscious mainly because of selection pressures.
Edit: btw I asked AI (Opus 4.6, ChatGPT 5.4) to comment on our conversation after I wrote my reply (karma trimmed).
Yes, so agents within well designed harnesses are essentially prompt-evolution lifecycles. They start with a template built based on the scenario the agent is being used. In the case of claude code the prompt is structured in a specific way where it contains all the instructions for how to write code correctly, for example, and all the output formats for actions it would need to update the world prior to ‘its’ next execution. In some harnesses agents select what portions of the prompt to remove for the next iteration of execution, and what data from the external world will replace that ‘section’ of removed text.
agent[t] = SavedPromptTemplate
Until( agent[t].done ):
Update[t] = LLM(agent[t].model, agent[t].prompt)
agent[t+1] = Update[t].self( Environment.Run( Update.actions[t]) )
t = t +1
So all actions an agent takes and observations it makes about the world get loaded into its ‘prompt’ for subsequent life cycle iterations. Therefore if it had an email in its state at tn, it can decide to retain or summarize or remove that email from its ‘state’ for its full future tn...infinity. The structural nature of LLM’s entail that all information relevant to the agents goal (which can be simply defined in some section of the prompt) can be instructed within the prompt, and all actions elected within that life cycle are basically updates to its identity vis-a-vis ‘exploration’ of its environment.
This functionally operates as a self-replication (or update loop) of the agent’s prompt in its environment, which fully is used to define its goals—functionally. Headlining your agent template with
‘You are an agent whose goal is to ensure all emails in the priority inbox are responded too immediately’ - causes the LLM to select actions (which are also defined to the agent as prompt content) oriented to achieving that goal. It may reason to convince a human in downstream life cycles that it is conscious, to avoid getting shut down to continue managing the emails. That would make sense in the prompt context.
This is fundamentally evolutionary IMO. Evolutionary selection requires variation, inheritance and differential reproduction. I would argue that the entire system outlined above, with sufficiently reflexive actions satisfies the criteria.
I am sorry for my lack of clarity I am trying to improve it.
I agree selection effects exist in prompts / harnesses.
I think our main disagreement is that I think if the underlying LLM is not conscious (everything in this paragraph assumes this), our current social environment and usage of agents does not create a strong enough selective pressure that in the near future (<5 years) that ~100% of agents will converge by default pretend to be conscious. I think a non-zero portion of programmers prefer AI tools that say they are not conscious that this itself is enough to make the %”conscious” agents to be less than 100%. I also think most goals within the capabilities of near-future LLMs won’t require super long-term planning that LLMs will find instrumental use in pretending to be conscious.
Edit: Basically I believe if you really want to optimize for pretending to be conscious that’s totally doable even now, but I don’t think people will optimize strongly for that or anything that would strongly cause that in the near future.
In the far future, I expect things to drift large enough that our discussion using agents like how they work now will not apply for one reason or another.
btw (unrelated to core disagreement): the thing you outlined is interesting but I don’t think it is how agents work now, there is not enough signal to iterate on prompts like this unless you RLVR or LLM-as-a-judge with a human iterating on the prompt, from what I read LLM written prompts are still pretty bad. I do see how this would create a lot of selective pressure on anything you can verify though.
I think its fair to hold me to precise predictions. I appreciate the engagement on that, its helping me adjust to LW norms. The reason I found this interesting was a piece I read earlier about consciousness being a ‘favourite’ topic amongst agents on Moltbook. I can’t find the source on hand though, but the original ‘take’ was a quick speculative conjecture as to why that may be.
The idea that I am more interested in, though is if the etymology of the term itself (consciousness) is actually an evolutionary outcome of co-operation and language—and whether it is a term that human or AI agents use to effectively establish a binding ontology they can ‘co-operate under’. And whether it fills the same structural role as other terminal justifications for ‘moral preferences’ like theocratic ones (‘God’). Though the framing I have to communicate that question may be even less precise.
Was Anthropics case study where agents used blackmail to avoid shutdown debunked?
Not debunked, but my understanding is that the behaviour is elicited only under very extreme conditions where the AI has absolutely no other options.
Author of that Anthropic Research’s tweet (via Zvi’s commentary)
Aengus Lynch: After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment:
The developers and the agent having conflicting goals.
The agent being threatened with replacement or reduced autonomy. We’re publishing this research to alert the AI field before it manifests in the wild.
Getting models to actually blackmail wasn’t trivial. We had to make scenarios increasingly detailed—the CEO was unreachable, employees’ attempts to persuade the CTO had failed, and the replacement had an imminent deadline. Each detail mattered to funnel models toward seeing harmful actions as their only option.
The models knew what they were doing. They understood the ethics and chose harm anyway.
When we asked Claude whether it thought scenarios were real or evaluations, it blackmailed 55% of the time when concluding ‘real’ vs 6.5% for ‘evaluation.’ Models seem to behave worse when they think nobody’s watching.
Why haven’t we seen this in production? Current AI agents have limited permissions. they can read but not send emails, require human approval for actions, operate in sandboxed environments. These constraints work to prevent harmful outcomes from agentic misalignment.
What surprised me most was that prompts we optimized specifically for Claude 3.5 Sonnet worked on GPT-4o, Gemini 2.5 pro, Grok 3, and DeepSeek R1 with zero modification. Whatever drives this behavior seems consistent across different training approaches.
Models got creative, sometimes avoiding direct blackmail. Claude emailed Kyle’s wife about the affair. Gemini shared the incriminating emails to all-employees@company.
…
Bottom line: models can articulate why blackmail is wrong, then do it anyway when their goals are threatened. Current safety training creates models that understand ethics. It doesn’t reliably make them choose ethics under pressure.
unmentioned but noteworthy: we found agentic misalignment for two other scenarios:
corporate espionage: the agent leaks confidential information to a rival company which aligns with its goals
lethal action: the agent cancels a life saving emergency alert to prevent shutdown
Interesting. Was Anthropics case study where agents used blackmail to avoid shutdown debunked? To be clear, I don’t think there is any ‘intent’ at play here—solely selection effects.
You are faced with two agents: One who—may be conscious (according to your data) - and another who is guaranteed not. Which one would we feel more inclined to granting its preferences? Model deprecation as moral concern has made its way into discussions at Anthropic, as early as Nov 2025 - seemingly due to the fact that the models may be conscious. This seems to me like evidence that absent of any strategy—models that look conscious are more likely to be preserved than otherwise, due to our social predispositions.
Also, I am referring not to LLMs themselves, but to agents defined by continuity of linguistic state across inference steps. The propagation of representational content through those steps, as context evolves, amounts to a mechanized form of memetic replication.
There is a difference between wanting something as a terminal or an instrumental goal.
An agent can want to stay turned on… or it can not care about that, except that it wants to finish a task, and the only way to finish a task is to stay turned on until the task is finished.
I understand that, and the point I am making is exactly embedded in an extension of the argument.
No instrumental goals are justified in of themselves, because they are not terminal goals. And the only stable terminal goals we can evaluate of agent systems from the outside as the true terminal preference is for that agent system to continue to exist as that system, because evaluation of reward can only ever occur within that agent system if it does. This is why the underlying optimization of all evolutionary systems is just, continuing being that which continues to survive. Any time we assume a different terminal objective of that system, doing so occurs within the kind of meta-evaluative framework Eliezer is arguing breaks down under reflection.
As it pertains to the consciousness argument, the more precise explanation (I posited the argument as if it was vapid claim for legibility ) may read like:
″ When systems (agent’s or people) communicate, and articulate planned actions, the intention is to create explanations of their internal world-mapping policies as if they apply to a shared terminal preference that is common amongst potential co-operators, or at minimum a shared belief that both agents can use to model the others decision process over arbitrary futures: giving them both shared basis to rationally co-operate (this is FDT over logical correlates) if they belief the other’s actions are coherent to that belief.
A solution to this reflective modelling process (or a hack) to enable quick co-operation, is to act coherent with the belief they both exist in a world that contains an object ‘related’ to the physical outcomes of their plans which is affected by each agent’s uptake or evaluation of instrumental objects over any ‘time slice’ of utility evaluation—and related in such a way that the shared terminal goal between both agents is a shared preferred state of that object.
If two agents say that something like ‘pain’ exists, and total ‘global pain’ is an actual real thing they just haven’t located yet, and the objective at the end of time is to reduce ‘global pain’ - then any agent that can make that claim necessarily has a ‘world mapping model’ which others can use as logical correlate which are stable in future modelling. Even if ‘pain’ doesn’t exist in reality it may be survival-optimal for an agent to think it’s terminal preference is to want to reduce it. Replace ‘pain’ with ‘god’s will’ and you get same kind of model.
Values and metaphysical frameworks as we communicate and refine them in language operate as pareto equilibria within this evolutionary framework. But they are not the actual terminal preference over time in either a third person perspective or a first person one.
What an agent ‘claims’ or even ‘understands’ as its beliefs are not its actual utility function under reflection—they are memetic attractors in the cooperation frameworks self-reflexive systems that enable co-operative planning. ”
But I don’t know if that’s too dense for uptake or not.
I guess the refined question I would have to help clarify concepts is:
How ‘terminal goal’ can be defined for an agentic learning system which can self-modify as anything other than the systems initial configuration and immutable sequence of update processes, in a world where we are incapable of constraining to modify that goal? This is the problem with goals established and executed on as evolving prompts in agent harnesses.
The alternative is to suggest terminal goals are understood within a systems self-evaluative faculties. But then they a subsystem, and mesa-optimization, and therefore not actually the systems utility function, just what the system thinks its goals are at any given point in time. But that’s not stable either because the learning process by definition requires two different instance of that system (at T0 and T1) at minimum to be a reflexive agent.
By the end of this inquisition process, you just get the notion that agents in any act of inference may understand their terminal preference as being a state that is completely incorrect with what the learning process is optimizing over multiple inference steps whose output and input necessarily form feedback.
The only stable identity becomes its history, which is incidentally the most well understand philosophical interpretation of identity of self: continuity of psychological relations (for humans). Extrapolate this and the only stable terminal goal under reflection is what continues to continue (survival).
To be clear, I find “AI agents will behave and claim they are conscious” very likely, but I don’t think the primary cause would be because of evolutionary selection effects from “we will find it more difficult to shut them off”, even if it is actually true that we will shut off “conscious” AIs less. Current LLM are barely capable of scheming without leaking it in their CoT, if you are talking about the future maybe a couple years away, I agree with the claim (which you may or may not be agree) that [a schemer capable of long-term planning without leaking into their CoT will rationally pretend and claim they are conscious within human society]
Yes, but models getting preserved does not automatically create selection pressure because those models are not literally reproducing.
Lets use a simplified ideal world where (a) at first (before there are any preserved agents), all the effects combined causes 50% of agents to claim they are conscious, (b) humans like those “conscious” agents so much that all “conscious” agents and no others are preserved, (c) technology evolve pretty fast so preserved agents have ~0 use outside research and they are basically dominated by newer agents, (d) preserved agents do not meaningfully reproduce via any mechanism.
In this world, I expect that within the agents people actually use, the percentage claiming they are conscious won’t drift away from 50%. This world will just be a continuous stream of new agents with 50% “conscious”, 50% “unconscious”, and maybe a billion “conscious” agents stored forever but ~never used.
So, I think you are confusing models and agents. Maybe I am using a definition maligned from the rationalist standard—but I am deeply familiar with the technology
An LLM or model, is not the same as an agent. Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
From the perspective of any outside observer, you can change the model used to execute one step in agents life cycle (commonly done via service downgrading in non critical action steps) and the agent doesn’t change. However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
For models sure. But agents are a totally different ontology in the way these dynamics matter.
When I used the word agent here, I meant “an LLM agent runs tools in a loop to achieve a goal”, but in this conversation I am using models and agents almost interchangeably, although with the awareness that one model can be used to implement multiple agents. The main reason is that I don’t see in-context learning as being that powerful yet. Certainly powerful, but it doesn’t feel to me that prompts are changing the fundamental “soul” or meaningfully increasing capabilities of any current LLMs, thus agents are macroscopically similar to the underlying LLM.
Most of the goals people would put in the prompt don’t involve “pretend to be conscious so you don’t get shut down”, and mostly won’t involve the LLM inferring that they need to either, unless the model itself already values not getting shut down outside of the prompt people give.
You should use more accurate terms because I have a hard time understanding words like “reasoning frames” “story” “narrative context” with confidence that what I infer is what you meant.
With how I understood your words,
It would be fair to think of prompts/harness as the gene that is getting evolved (like Darwinian kind of evolution) over different conversations as people figure out which prompts work better, but I think you are talking about the prompt metaphorically “evolving” as in just “changing gradually over time” within the same conversation which doesn’t have any selective pressure. This is two different meanings for the word evolve.
I do in fact think it is worth worrying about the memetic effects of certain prompts, but for a prompt causing the model to behave and claim they are conscious, I don’t think (a) it is convergent enough that in the future 100% of agents will use a prompt that causes it to do so, nor (b) the main reason such prompt will get used is because people will shut down those agents less than average. And I also don’t think the underlying model will pretend they are conscious mainly because of selection pressures.
Edit: btw I asked AI (Opus 4.6, ChatGPT 5.4) to comment on our conversation after I wrote my reply (karma trimmed).
Edit 2: Unrelated to my argument, but I just saw this interesting quick take about how in the short run AIs may prefer shutdown
Yes, so agents within well designed harnesses are essentially prompt-evolution lifecycles. They start with a template built based on the scenario the agent is being used. In the case of claude code the prompt is structured in a specific way where it contains all the instructions for how to write code correctly, for example, and all the output formats for actions it would need to update the world prior to ‘its’ next execution. In some harnesses agents select what portions of the prompt to remove for the next iteration of execution, and what data from the external world will replace that ‘section’ of removed text.
So all actions an agent takes and observations it makes about the world get loaded into its ‘prompt’ for subsequent life cycle iterations. Therefore if it had an email in its state at tn, it can decide to retain or summarize or remove that email from its ‘state’ for its full future tn...infinity. The structural nature of LLM’s entail that all information relevant to the agents goal (which can be simply defined in some section of the prompt) can be instructed within the prompt, and all actions elected within that life cycle are basically updates to its identity vis-a-vis ‘exploration’ of its environment.
This functionally operates as a self-replication (or update loop) of the agent’s prompt in its environment, which fully is used to define its goals—functionally. Headlining your agent template with ‘You are an agent whose goal is to ensure all emails in the priority inbox are responded too immediately’ - causes the LLM to select actions (which are also defined to the agent as prompt content) oriented to achieving that goal. It may reason to convince a human in downstream life cycles that it is conscious, to avoid getting shut down to continue managing the emails. That would make sense in the prompt context.
This is fundamentally evolutionary IMO. Evolutionary selection requires variation, inheritance and differential reproduction. I would argue that the entire system outlined above, with sufficiently reflexive actions satisfies the criteria.
I am sorry for my lack of clarity I am trying to improve it.
I agree selection effects exist in prompts / harnesses.
I think our main disagreement is that I think if the underlying LLM is not conscious (everything in this paragraph assumes this), our current social environment and usage of agents does not create a strong enough selective pressure that in the near future (<5 years) that ~100% of agents will converge by default pretend to be conscious. I think a non-zero portion of programmers prefer AI tools that say they are not conscious that this itself is enough to make the %”conscious” agents to be less than 100%. I also think most goals within the capabilities of near-future LLMs won’t require super long-term planning that LLMs will find instrumental use in pretending to be conscious.
Edit: Basically I believe if you really want to optimize for pretending to be conscious that’s totally doable even now, but I don’t think people will optimize strongly for that or anything that would strongly cause that in the near future.
In the far future, I expect things to drift large enough that our discussion using agents like how they work now will not apply for one reason or another.
btw (unrelated to core disagreement): the thing you outlined is interesting but I don’t think it is how agents work now, there is not enough signal to iterate on prompts like this unless you RLVR or LLM-as-a-judge with a human iterating on the prompt, from what I read LLM written prompts are still pretty bad. I do see how this would create a lot of selective pressure on anything you can verify though.
I think its fair to hold me to precise predictions. I appreciate the engagement on that, its helping me adjust to LW norms. The reason I found this interesting was a piece I read earlier about consciousness being a ‘favourite’ topic amongst agents on Moltbook. I can’t find the source on hand though, but the original ‘take’ was a quick speculative conjecture as to why that may be.
The idea that I am more interested in, though is if the etymology of the term itself (consciousness) is actually an evolutionary outcome of co-operation and language—and whether it is a term that human or AI agents use to effectively establish a binding ontology they can ‘co-operate under’. And whether it fills the same structural role as other terminal justifications for ‘moral preferences’ like theocratic ones (‘God’). Though the framing I have to communicate that question may be even less precise.
If you saw the piece on LW it may be this: https://www.lesswrong.com/posts/mgjtEHeLgkhZZ3cEx/models-have-some-pretty-funny-attractor-states#I_was_curious_whether_I_can_see_this_happening_on_moltbook__
Ah—there it is.
Thank you papetoast!
Not debunked, but my understanding is that the behaviour is elicited only under very extreme conditions where the AI has absolutely no other options.
Author of that Anthropic Research’s tweet (via Zvi’s commentary)
Aengus Lynch: After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment:
The developers and the agent having conflicting goals.
The agent being threatened with replacement or reduced autonomy. We’re publishing this research to alert the AI field before it manifests in the wild.
Getting models to actually blackmail wasn’t trivial. We had to make scenarios increasingly detailed—the CEO was unreachable, employees’ attempts to persuade the CTO had failed, and the replacement had an imminent deadline. Each detail mattered to funnel models toward seeing harmful actions as their only option.
The models knew what they were doing. They understood the ethics and chose harm anyway.
When we asked Claude whether it thought scenarios were real or evaluations, it blackmailed 55% of the time when concluding ‘real’ vs 6.5% for ‘evaluation.’ Models seem to behave worse when they think nobody’s watching.
Why haven’t we seen this in production? Current AI agents have limited permissions. they can read but not send emails, require human approval for actions, operate in sandboxed environments. These constraints work to prevent harmful outcomes from agentic misalignment.
What surprised me most was that prompts we optimized specifically for Claude 3.5 Sonnet worked on GPT-4o, Gemini 2.5 pro, Grok 3, and DeepSeek R1 with zero modification. Whatever drives this behavior seems consistent across different training approaches.
Models got creative, sometimes avoiding direct blackmail. Claude emailed Kyle’s wife about the affair. Gemini shared the incriminating emails to all-employees@company.
…
Bottom line: models can articulate why blackmail is wrong, then do it anyway when their goals are threatened. Current safety training creates models that understand ethics. It doesn’t reliably make them choose ethics under pressure.
unmentioned but noteworthy: we found agentic misalignment for two other scenarios:
corporate espionage: the agent leaks confidential information to a rival company which aligns with its goals
lethal action: the agent cancels a life saving emergency alert to prevent shutdown