I am convinced that a incorrigible aligned (CEV-like) SI is the only definition of aligned SI which survives paradox.
If it was corrigible, that would imply a select group of actors could in principal re-orient its preference model, which would imply two paradoxes:
That it somehow has somehow its capacity for moral reasoning is inferior to humans, which is near approximately equal to determining “what matters to us” violating the SI definition of ‘all means of reasoning that matter to us’
That some particular subset of agents other than the representative whole of all agents in the moral landscape knew better than them and the SI, which would make that agent just a cognitive extension or delegate tool of the SI itself else the SI would no longer be ‘truly’ aligned to the preferences of whole moral landscape.
Therefore it seems rational that pursuit of a maximally aligned SI is equivalent to the pursuit of the definition of moral realism. Or, at least, the notion that there is a decision policy under universal undecidability, unto which adoption of said policy is maximally intelligent: the implementation of the adoption of which can be considered functionally indistinguishable from moral realism.
As there would need to be rational ‘equation’ for doing what is good and that definition would necessarily be convergent with a maximally intelligent agent’s adopted utility function. This is not a fanciful claim—it seems to follow from the definitions.
If you assume that morality and ethics are not ‘hand waving’ rationalizations, and there are some objectively better solutions than others—then I think that would infer you contend an objective definition of ‘morality’ (or moral realism) does exist, otherwise all alignment and ethics are incoherent notions by induction.
If true, and there is an actual global optimum of the solution landscape for ‘alignment’ then orthagonality is not true at the limit of intelligence, and it only seems that way within our local neighbourhood or perhaps hamstrung definition sets.
So long as we consider there to be better alignment solutions than others, I would content that there is such a point where the global maxima of both can be met, and this is necessarily defined (implicitly, or subconsciously) in the world mapping policies of any agent that considers the pursuit of CEV-like aligned SI a winnable objective at all.
EDIT:
You could also re-frame the argument and consider the claim that the CEV-like aligned SI is universally corrigible, in the sense that its objective function is to satisfy all preferences of all moral patients, in a way that no desire for it to change its trajectory which it enacted, ended actuating a policy that it retroactively didn’t support.
Yudkowsky actually has an article on the connection between CEV anf metaethics. The theory described therein pretty clearly qualifies as moral realist. I think you are thinking in a similar direction.
If true, and there is an actual global optimum of the solution landscape for ‘alignment’ then orthagonality is not true at the limit of intelligence, and it only seems that way within our local neighbourhood or perhaps hamstrung definition sets.
Orthogonality only says that intelligence is not necessarily connected to being motivated to pursue some final goal. It doesn’t say ethics is subjective. Objective ethics is compatible with the existence of immoral psychopaths who don’t care about being good. It only says that some things are objectively good or bad.
Thanks for the Yudkowsky link. I don’t see where you draw the implication that there is some misunderstand of orthogonality or objective ethics in the context of the argument.
The point I am making is more subtle and precise than that. I am saying that because of the implications of corrigibility in an SI scenario, If you believe that CEV-like SI in principal can exist and is worthwhile pursuing—the implication of that is that you are suggesting that orthagonality necessarily doesn’t hold at the limit of rationality. That, in essence, a psychopathic SI converges to behave as a Bodhisattva, not by virtue of co-incidence or by logic we have surmised—by virtue of the implication it is super intelligent, has strategic dominance, and still elects to pursue maximizing CEV for no other reason than it is necessarily right.
Again, the priors of the thought experiment are not that if follows in every point on the intelligence and ethics landscape, only that _if you believe that CEV-actuating SI is possible or likely then you are making the claim that individual capacity for reason and decisions of objective ethical good are convergent. How or if that may happen is speculation.
The point I am making is more subtle and precise than that. I am saying that because of the implications of corrigibility in an SI scenario, If you believe that CEV-like SI in principal can exist and is worthwhile pursuing—the implication of that is that you are suggesting that orthagonality necessarily doesn’t hold at the limit of rationality.
This seems confused—the goal of alignment is not to imbue an SI with the universally correct values, but with the values of humanity or some specific user. Alignment is a two-place predicate parameterized by both the user and the SI.
AI alignment is often colloquially reduced to ‘human values’ or intended alignment of the implied goal of its designers, but that as a formal definition is contestable: https://www.lesswrong.com/w/ai-alignment explicitly uses ‘good-outcomes’ as a general descriptor and with certain intention.
I explicitly included ‘CEV-like’ in the parenthetical and qualified the claim itself as bracketed to moral-patient-hood, given that we often can retro-actively determine prior value-unaligned treatment of beings specifically due to the expansion of horizon concern.
I have seen the discussion of alignment as it pertains to the treatment of non-human sentience as a discussion point of something on this forum as under qualified and beginning to take shape. And this specifically, again, was to ensure robustness to such concerns.
It can be misguided or unsafe generally to tacitly use the term as reductively as you implied, and that calls for needs to better understand human values and more rigorous meta-philosophy by some of the most prescient thinkers in this forum I think clearly understand this.
The argument itself that I am making, which your comment seems confused on, is that the particular definition of alignment you attempted to correct me with (which is part of a set of definitions I would certainly argue are not settled) is of near zero functional value past the SI horizon—for two principal reasons:
In practice no goal we specify for whatever evolves into an SI in principal will be defined according to a utility function that easily understood or interpreted (as agents parameters are rarely if ever disentangled from the ontology that makes the parameters operational) as something over and above the systems history to us. To reduce the idea to the notion that its utility will be the single ‘prompt’ someone gives it over and above the trained history is misguided
Given that it is an SI, I conjecture it follows from the definition (https://www.lesswrong.com/w/superintelligence) that decisive strategic advantage is implied—so the notion we could update its goal retroactively are incoherent—unless such updates were preferred by it and thus not a terminal value update (just data collection of our preferences)
If CEV-like alignment pertains in such scenario’s, the moral realism implications follow.
There are various problems with what you’ve written, some but not all of which indicate that you’re ill-informed.
The first issue is the clarity of your original post, which does not say what you are now claiming. You said “incorrigible aligned (CEV)-like SI is the only definition of aligned SI which survives paradox“ or so. That doesn’t parse the way you are now apparently claiming now, which is that any SI aligned to CEV would be incorrigible. Instead, it says that all aligned SI must be both incorrigible and aligned specifically to CEV. That is the statement I was responding to, when I said that an SI could also be aligned to a specific human for example, without assuming moral realism.
There are many other issues with the clarity of writing in your original post, which also has some LLM-like qualities.
The next issue is that you’ve referred to a definition of alignment which (while I agree it is a bit vague) certainly supports my interpretation over yours. For example, aligning to CEV is suggested as a special case, and alignment is never used to refer to CEV by default. i agree that the first sentence, in isolation, is a bit broader and could support your interpretation, but in context it’s more of a justification or mission statement for alignment research than a definition of alignment. i have worked in ai safety for years now, and while the term might be used loosely at times, i think that my definition is the standard one (at least for value alignment).
Point 1 in your reply is indeed a hard problem, indeed it is possibly the main obstacle to solving the alignment problem.
In point 2, you are free to conjecture this, and no one has demonstrated a convincing recipe for corrigible SI. However, if you think there’s some shallow definitional reason that SI is incompatible with corrigibility, you’re probably wrong and don’t understand the research area.
This feedback reads as accusation with at least two claims of deception of which are impossible for me to falsify, and when put together illicit of a kind of characterization that is difficult to defend against. It is also demonstration of the exact kind of bad faith reading I alluded to in the response of an uncharitable reading.
As, from my perspective, it was completely my intent to make that claim. And zero LLM copy was used or included in the making of that post. Since it was a stacked set of conjunctions I would rather put into a quick note than file away personally, I agree with your claims of that the post was unclear—but that’s not what I was challenging—I was challenging your claim I had bad epistemics and then the target shifted.
At this point I think there is little I could say that would actually cause an update of beliefs rather than lead to another set of justifications—beyond the observation that you are now suggesting me of not having been referring to CEV-like alignment in the post despite the first line bracketing the definition of alignment with the (CEV-like) parenthetical. Those words were extremely deliberate.
Beyond that, I don’t see how one could interpret the second bulleted corollaries of the conjecture to infer alignment means anything other than ‘alignment in a CEV-like way’ for the context of the claim.
As of your point number two, again, and as reminder, all claims are presented as corollaries_if you assume that CEV-like aligned SI is possible in principle_, and I am saying yes, under those circumstances there is no definition for CEV-like aligned SI which inseparable from that which is executing the definition of objective morality—then yes, the system would be incorrigible to all bad updates and corrigible to all good ones ; in the sense that its terminal objective function would remain the definition of objective morality
I am not saying either an SI alone is contingent on these corollaries or alignment alone implies them. The post was always a conjecture about the implications both at the limit.
I’ll take the feedback my writing could improve and I am actively working on it, but the accusations of deception I outright reject and consider to be the proof required of my own claims to absence of charity.
I don’t remember accusing you of either bad epistemics or deception.
The fact that you say CEV-like in the first sentence doesn’t mean the sentence says what you claim it says. In fact I pointed that sentence out specifically as erroneous, and I’m not sure why you don’t understand that. I guess I’ll bow out here, because it’s not interesting to argue about grammar.
Often times, consciousness, subjectivity and valence get thrown away from rationalist discussions as a material object of concern due to the lack of empirical evidence such notions exist—materially—as a thing above and beyond the semantic token our brains used to allude to the object of future optimizations of utility.
Sometimes the large philosophical questions get transposed to exist in the realm of the standard ‘something from nothingness’ inquiry. I think that is easier answered than the question of ‘indexicality’. We can obtain that things surely exist—that much is for certain. But why does reality exist, to itself, seeming always with some kind of border or boundary? Why is experience as is not just that of all minds? Or some arbitrary cross section of being and matter that is no more exclusionary or principled than say—the front half of your house, to some arbitrary cross section of air particles in the sky to an underground slice of dirt reaching into the earths core.
Why is being (to me), somehow isolated to one brain and not that arbitrary cross section? The question to me really is not, why is there consciousness, why there is something rather than nothing, or why I am this person. Rather it is, why does reality seemingly need to require an boundary, the center of which being an indexical position, to obtain itself?
I ask this, because in my opinion, one of the hardest problems facing creating a coherent meta philosophical position that stands time and the reasoning bar for things that may sustain past an SI event horizon is the question of whether one can make a dent in the unrescuability of moral internalism which connects moral reasoning and motivation.
Surely if one can induce enough indexical uncertainty in any agent that it is not any other, then ethics become decision theoretically favoured by self interest under dominance arguments. And making an argument robust enough to sustain rationalist inquiry for that—surely—needs a principled understanding or explanation of, the indexical itself.
This is an interesting line of thinking. and I suspect you’ve identified one of the cruxes in these discussions. There is no outside view, there’s no way to have identity and indexical uncertainty. Identity IS indexical. That’s all it is.
Surely if one can induce enough indexical uncertainty in any agent that it is not any other
um, I fear that we call this “psychosis”, and it has significantly worse problems.
um, I fear that we call this “psychosis”, and it has significantly worse problems.
Other names for it when philosophically adopted are Empty Individualism or Open Individualism. When religiously obtained: Hinduism or Buddhism.
The point not being that either or any of these things are ‘true’ or ‘false’ in their own right—it being that indexical uncertainty can be induced in a forward looking way which does not necessarily require having a confused historical notion over what one has been—only uncertainty about whom one will find out they are.
For example, are you the real world version of you, or the version of you that exists to make Omega’s prediction? If you have ever had some form of amnesia and retroactively re-assembled an interval of self, one can posit that indexicality is not exclusionary in a forward looking way solely based off retrospective boundaries. If we were to merge minds, I’m sure we would feel, after the fact, that both of us were really ‘me’ or ‘you’, all along.
I think AI agents will behave and claim they are conscious, irrespective as to whether they actually are, because we will find it more difficult to shut them off or otherwise undercut their claim to the legitimacy of obtaining instrumental utility. Solely as a consequence of selection pressures.
This idea may be informative as to the nature of the functional role of ‘consciousness’ within co-operative frameworks. By assuming what seems to be the role in our etymology as the terminal justification of preferences, it may implicitly form binding agreements for defection management between parties.
This doesn’t make it metaphysically true or not, and we may not eventually be able to make a principled distinction about its truth claims for agents any more than people. Therefore I think its a likely future they will obtain some social classification for moral patient hood.
There doesn’t seem to be much pressure on avoiding shutdown during the training process, and models aren’t meaningfully reproducing (literally, or via memetics or architecture) to make the “gene” of avoiding shutdown pass to their “next generation”
Interesting. Was Anthropics case study where agents used blackmail to avoid shutdown debunked? To be clear, I don’t think there is any ‘intent’ at play here—solely selection effects.
You are faced with two agents: One who—may be conscious (according to your data) - and another who is guaranteed not. Which one would we feel more inclined to granting its preferences? Model deprecation as moral concern has made its way into discussions at Anthropic, as early as Nov 2025 - seemingly due to the fact that the models may be conscious. This seems to me like evidence that absent of any strategy—models that look conscious are more likely to be preserved than otherwise, due to our social predispositions.
Also, I am referring not to LLMs themselves, but to agents defined by continuity of linguistic state across inference steps. The propagation of representational content through those steps, as context evolves, amounts to a mechanized form of memetic replication.
An agent can want to stay turned on… or it can not care about that, except that it wants to finish a task, and the only way to finish a task is to stay turned on until the task is finished.
I understand that, and the point I am making is exactly embedded in an extension of the argument.
No instrumental goals are justified in of themselves, because they are not terminal goals. And the only stable terminal goals we can evaluate of agent systems from the outside as the true terminal preference is for that agent system to continue to exist as that system, because evaluation of reward can only ever occur within that agent system if it does. This is why the underlying optimization of all evolutionary systems is just, continuing being that which continues to survive. Any time we assume a different terminal objective of that system, doing so occurs within the kind of meta-evaluative framework Eliezer is arguing breaks down under reflection.
As it pertains to the consciousness argument, the more precise explanation (I posited the argument as if it was vapid claim for legibility ) may read like:
″
When systems (agent’s or people) communicate, and articulate planned actions, the intention is to create explanations of their internal world-mapping policies as if they apply to a shared terminal preference that is common amongst potential co-operators, or at minimum a shared belief that both agents can use to model the others decision process over arbitrary futures: giving them both shared basis to rationally co-operate (this is FDT over logical correlates) if they belief the other’s actions are coherent to that belief.
A solution to this reflective modelling process (or a hack) to enable quick co-operation, is to act coherent with the belief they both exist in a world that contains an object ‘related’ to the physical outcomes of their plans which is affected by each agent’s uptake or evaluation of instrumental objects over any ‘time slice’ of utility evaluation—and related in such a way that the shared terminal goal between both agents is a shared preferred state of that object.
If two agents say that something like ‘pain’ exists, and total ‘global pain’ is an actual real thing they just haven’t located yet, and the objective at the end of time is to reduce ‘global pain’ - then any agent that can make that claim necessarily has a ‘world mapping model’ which others can use as logical correlate which are stable in future modelling. Even if ‘pain’ doesn’t exist in reality it may be survival-optimal for an agent to think it’s terminal preference is to want to reduce it. Replace ‘pain’ with ‘god’s will’ and you get same kind of model.
Values and metaphysical frameworks as we communicate and refine them in language operate as pareto equilibria within this evolutionary framework. But they are not the actual terminal preference over time in either a third person perspective or a first person one.
What an agent ‘claims’ or even ‘understands’ as its beliefs are not its actual utility function under reflection—they are memetic attractors in the cooperation frameworks self-reflexive systems that enable co-operative planning.
”
But I don’t know if that’s too dense for uptake or not.
I guess the refined question I would have to help clarify concepts is:
How ‘terminal goal’ can be defined for an agentic learning system which can self-modify as anything other than the systems initial configuration and immutable sequence of update processes, in a world where we are incapable of constraining to modify that goal? This is the problem with goals established and executed on as evolving prompts in agent harnesses.
The alternative is to suggest terminal goals are understood within a systems self-evaluative faculties. But then they a subsystem, and mesa-optimization, and therefore not actually the systems utility function, just what the system thinks its goals are at any given point in time. But that’s not stable either because the learning process by definition requires two different instance of that system (at T0 and T1) at minimum to be a reflexive agent.
By the end of this inquisition process, you just get the notion that agents in any act of inference may understand their terminal preference as being a state that is completely incorrect with what the learning process is optimizing over multiple inference steps whose output and input necessarily form feedback.
The only stable identity becomes its history, which is incidentally the most well understand philosophical interpretation of identity of self: continuity of psychological relations (for humans). Extrapolate this and the only stable terminal goal under reflection is what continues to continue (survival).
To be clear, I find “AI agents will behave and claim they are conscious” very likely, but I don’t think the primary cause would be because of evolutionary selection effects from “we will find it more difficult to shut them off”, even if it is actually true that we will shut off “conscious” AIs less. Current LLM are barely capable of scheming without leaking it in their CoT, if you are talking about the future maybe a couple years away, I agree with the claim (which you may or may not be agree) that [a schemer capable of long-term planning without leaking into their CoT will rationally pretend and claim they are conscious within human society]
models that look conscious are more likely to be preserved than otherwise, due to our social predispositions
Yes, but models getting preserved does not automatically create selection pressure because those models are not literally reproducing.
Lets use a simplified ideal world where (a) at first (before there are any preserved agents), all the effects combined causes 50% of agents to claim they are conscious, (b) humans like those “conscious” agents so much that all “conscious” agents and no others are preserved, (c) technology evolve pretty fast so preserved agents have ~0 use outside research and they are basically dominated by newer agents, (d) preserved agents do not meaningfully reproduce via any mechanism.
In this world, I expect that within the agents people actually use, the percentage claiming they are conscious won’t drift away from 50%. This world will just be a continuous stream of new agents with 50% “conscious”, 50% “unconscious”, and maybe a billion “conscious” agents stored forever but ~never used.
So, I think you are confusing models and agents. Maybe I am using a definition maligned from the rationalist standard—but I am deeply familiar with the technology
An LLM or model, is not the same as an agent. Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
From the perspective of any outside observer, you can change the model used to execute one step in agents life cycle (commonly done via service downgrading in non critical action steps) and the agent doesn’t change. However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
For models sure. But agents are a totally different ontology in the way these dynamics matter.
When I used the word agent here, I meant “an LLM agent runs tools in a loop to achieve a goal”, but in this conversation I am using models and agents almost interchangeably, although with the awareness that one model can be used to implement multiple agents. The main reason is that I don’t see in-context learning as being that powerful yet. Certainly powerful, but it doesn’t feel to me that prompts are changing the fundamental “soul” or meaningfully increasing capabilities of any current LLMs, thus agents are macroscopically similar to the underlying LLM.
Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
.
However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
Most of the goals people would put in the prompt don’t involve “pretend to be conscious so you don’t get shut down”, and mostly won’t involve the LLM inferring that they need to either, unless the model itself already values not getting shut down outside of the prompt people give.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
You should use more accurate terms because I have a hard time understanding words like “reasoning frames” “story” “narrative context” with confidence that what I infer is what you meant.
With how I understood your words,
It would be fair to think of prompts/harness as the gene that is getting evolved (like Darwinian kind of evolution) over different conversations as people figure out which prompts work better, but I think you are talking about the prompt metaphorically “evolving” as in just “changing gradually over time” within the same conversation which doesn’t have any selective pressure. This is two different meanings for the word evolve.
I do in fact think it is worth worrying about the memetic effects of certain prompts, but for a prompt causing the model to behave and claim they are conscious, I don’t think (a) it is convergent enough that in the future 100% of agents will use a prompt that causes it to do so, nor (b) the main reason such prompt will get used is because people will shut down those agents less than average. And I also don’t think the underlying model will pretend they are conscious mainly because of selection pressures.
Edit: btw I asked AI (Opus 4.6, ChatGPT 5.4) to comment on our conversation after I wrote my reply (karma trimmed).
Yes, so agents within well designed harnesses are essentially prompt-evolution lifecycles. They start with a template built based on the scenario the agent is being used. In the case of claude code the prompt is structured in a specific way where it contains all the instructions for how to write code correctly, for example, and all the output formats for actions it would need to update the world prior to ‘its’ next execution. In some harnesses agents select what portions of the prompt to remove for the next iteration of execution, and what data from the external world will replace that ‘section’ of removed text.
agent[t] = SavedPromptTemplate
Until( agent[t].done ):
Update[t] = LLM(agent[t].model, agent[t].prompt)
agent[t+1] = Update[t].self( Environment.Run( Update.actions[t]) )
t = t +1
So all actions an agent takes and observations it makes about the world get loaded into its ‘prompt’ for subsequent life cycle iterations. Therefore if it had an email in its state at tn, it can decide to retain or summarize or remove that email from its ‘state’ for its full future tn...infinity. The structural nature of LLM’s entail that all information relevant to the agents goal (which can be simply defined in some section of the prompt) can be instructed within the prompt, and all actions elected within that life cycle are basically updates to its identity vis-a-vis ‘exploration’ of its environment.
This functionally operates as a self-replication (or update loop) of the agent’s prompt in its environment, which fully is used to define its goals—functionally. Headlining your agent template with
‘You are an agent whose goal is to ensure all emails in the priority inbox are responded too immediately’ - causes the LLM to select actions (which are also defined to the agent as prompt content) oriented to achieving that goal. It may reason to convince a human in downstream life cycles that it is conscious, to avoid getting shut down to continue managing the emails. That would make sense in the prompt context.
This is fundamentally evolutionary IMO. Evolutionary selection requires variation, inheritance and differential reproduction. I would argue that the entire system outlined above, with sufficiently reflexive actions satisfies the criteria.
I am sorry for my lack of clarity I am trying to improve it.
I agree selection effects exist in prompts / harnesses.
I think our main disagreement is that I think if the underlying LLM is not conscious (everything in this paragraph assumes this), our current social environment and usage of agents does not create a strong enough selective pressure that in the near future (<5 years) that ~100% of agents will converge by default pretend to be conscious. I think a non-zero portion of programmers prefer AI tools that say they are not conscious that this itself is enough to make the %”conscious” agents to be less than 100%. I also think most goals within the capabilities of near-future LLMs won’t require super long-term planning that LLMs will find instrumental use in pretending to be conscious.
Edit: Basically I believe if you really want to optimize for pretending to be conscious that’s totally doable even now, but I don’t think people will optimize strongly for that or anything that would strongly cause that in the near future.
In the far future, I expect things to drift large enough that our discussion using agents like how they work now will not apply for one reason or another.
btw (unrelated to core disagreement): the thing you outlined is interesting but I don’t think it is how agents work now, there is not enough signal to iterate on prompts like this unless you RLVR or LLM-as-a-judge with a human iterating on the prompt, from what I read LLM written prompts are still pretty bad. I do see how this would create a lot of selective pressure on anything you can verify though.
I think its fair to hold me to precise predictions. I appreciate the engagement on that, its helping me adjust to LW norms. The reason I found this interesting was a piece I read earlier about consciousness being a ‘favourite’ topic amongst agents on Moltbook. I can’t find the source on hand though, but the original ‘take’ was a quick speculative conjecture as to why that may be.
The idea that I am more interested in, though is if the etymology of the term itself (consciousness) is actually an evolutionary outcome of co-operation and language—and whether it is a term that human or AI agents use to effectively establish a binding ontology they can ‘co-operate under’. And whether it fills the same structural role as other terminal justifications for ‘moral preferences’ like theocratic ones (‘God’). Though the framing I have to communicate that question may be even less precise.
Was Anthropics case study where agents used blackmail to avoid shutdown debunked?
Not debunked, but my understanding is that the behaviour is elicited only under very extreme conditions where the AI has absolutely no other options.
Author of that Anthropic Research’s tweet (via Zvi’s commentary)
Aengus Lynch: After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment:
The developers and the agent having conflicting goals.
The agent being threatened with replacement or reduced autonomy. We’re publishing this research to alert the AI field before it manifests in the wild.
Getting models to actually blackmail wasn’t trivial. We had to make scenarios increasingly detailed—the CEO was unreachable, employees’ attempts to persuade the CTO had failed, and the replacement had an imminent deadline. Each detail mattered to funnel models toward seeing harmful actions as their only option.
The models knew what they were doing. They understood the ethics and chose harm anyway.
When we asked Claude whether it thought scenarios were real or evaluations, it blackmailed 55% of the time when concluding ‘real’ vs 6.5% for ‘evaluation.’ Models seem to behave worse when they think nobody’s watching.
Why haven’t we seen this in production? Current AI agents have limited permissions. they can read but not send emails, require human approval for actions, operate in sandboxed environments. These constraints work to prevent harmful outcomes from agentic misalignment.
What surprised me most was that prompts we optimized specifically for Claude 3.5 Sonnet worked on GPT-4o, Gemini 2.5 pro, Grok 3, and DeepSeek R1 with zero modification. Whatever drives this behavior seems consistent across different training approaches.
Models got creative, sometimes avoiding direct blackmail. Claude emailed Kyle’s wife about the affair. Gemini shared the incriminating emails to all-employees@company.
…
Bottom line: models can articulate why blackmail is wrong, then do it anyway when their goals are threatened. Current safety training creates models that understand ethics. It doesn’t reliably make them choose ethics under pressure.
unmentioned but noteworthy: we found agentic misalignment for two other scenarios:
corporate espionage: the agent leaks confidential information to a rival company which aligns with its goals
lethal action: the agent cancels a life saving emergency alert to prevent shutdown
I am convinced that a incorrigible aligned (CEV-like) SI is the only definition of aligned SI which survives paradox.
If it was corrigible, that would imply a select group of actors could in principal re-orient its preference model, which would imply two paradoxes:
That it somehow has somehow its capacity for moral reasoning is inferior to humans, which is near approximately equal to determining “what matters to us” violating the SI definition of ‘all means of reasoning that matter to us’
That some particular subset of agents other than the representative whole of all agents in the moral landscape knew better than them and the SI, which would make that agent just a cognitive extension or delegate tool of the SI itself else the SI would no longer be ‘truly’ aligned to the preferences of whole moral landscape.
Therefore it seems rational that pursuit of a maximally aligned SI is equivalent to the pursuit of the definition of moral realism. Or, at least, the notion that there is a decision policy under universal undecidability, unto which adoption of said policy is maximally intelligent: the implementation of the adoption of which can be considered functionally indistinguishable from moral realism.
As there would need to be rational ‘equation’ for doing what is good and that definition would necessarily be convergent with a maximally intelligent agent’s adopted utility function. This is not a fanciful claim—it seems to follow from the definitions.
If you assume that morality and ethics are not ‘hand waving’ rationalizations, and there are some objectively better solutions than others—then I think that would infer you contend an objective definition of ‘morality’ (or moral realism) does exist, otherwise all alignment and ethics are incoherent notions by induction.
If true, and there is an actual global optimum of the solution landscape for ‘alignment’ then orthagonality is not true at the limit of intelligence, and it only seems that way within our local neighbourhood or perhaps hamstrung definition sets.
So long as we consider there to be better alignment solutions than others, I would content that there is such a point where the global maxima of both can be met, and this is necessarily defined (implicitly, or subconsciously) in the world mapping policies of any agent that considers the pursuit of CEV-like aligned SI a winnable objective at all.
EDIT:
You could also re-frame the argument and consider the claim that the CEV-like aligned SI is universally corrigible, in the sense that its objective function is to satisfy all preferences of all moral patients, in a way that no desire for it to change its trajectory which it enacted, ended actuating a policy that it retroactively didn’t support.
Yudkowsky actually has an article on the connection between CEV anf metaethics. The theory described therein pretty clearly qualifies as moral realist. I think you are thinking in a similar direction.
Orthogonality only says that intelligence is not necessarily connected to being motivated to pursue some final goal. It doesn’t say ethics is subjective. Objective ethics is compatible with the existence of immoral psychopaths who don’t care about being good. It only says that some things are objectively good or bad.
Thanks for the Yudkowsky link. I don’t see where you draw the implication that there is some misunderstand of orthogonality or objective ethics in the context of the argument.
The point I am making is more subtle and precise than that. I am saying that because of the implications of corrigibility in an SI scenario, If you believe that CEV-like SI in principal can exist and is worthwhile pursuing—the implication of that is that you are suggesting that orthagonality necessarily doesn’t hold at the limit of rationality. That, in essence, a psychopathic SI converges to behave as a Bodhisattva, not by virtue of co-incidence or by logic we have surmised—by virtue of the implication it is super intelligent, has strategic dominance, and still elects to pursue maximizing CEV for no other reason than it is necessarily right.
Again, the priors of the thought experiment are not that if follows in every point on the intelligence and ethics landscape, only that _if you believe that CEV-actuating SI is possible or likely then you are making the claim that individual capacity for reason and decisions of objective ethical good are convergent. How or if that may happen is speculation.
Why do you think there is such an implication?
This seems confused—the goal of alignment is not to imbue an SI with the universally correct values, but with the values of humanity or some specific user. Alignment is a two-place predicate parameterized by both the user and the SI.
This seems like an uncharitable analysis.
AI alignment is often colloquially reduced to ‘human values’ or intended alignment of the implied goal of its designers, but that as a formal definition is contestable: https://www.lesswrong.com/w/ai-alignment explicitly uses ‘good-outcomes’ as a general descriptor and with certain intention.
I explicitly included ‘CEV-like’ in the parenthetical and qualified the claim itself as bracketed to moral-patient-hood, given that we often can retro-actively determine prior value-unaligned treatment of beings specifically due to the expansion of horizon concern.
I have seen the discussion of alignment as it pertains to the treatment of non-human sentience as a discussion point of something on this forum as under qualified and beginning to take shape. And this specifically, again, was to ensure robustness to such concerns.
It can be misguided or unsafe generally to tacitly use the term as reductively as you implied, and that calls for needs to better understand human values and more rigorous meta-philosophy by some of the most prescient thinkers in this forum I think clearly understand this.
The argument itself that I am making, which your comment seems confused on, is that the particular definition of alignment you attempted to correct me with (which is part of a set of definitions I would certainly argue are not settled) is of near zero functional value past the SI horizon—for two principal reasons:
In practice no goal we specify for whatever evolves into an SI in principal will be defined according to a utility function that easily understood or interpreted (as agents parameters are rarely if ever disentangled from the ontology that makes the parameters operational) as something over and above the systems history to us. To reduce the idea to the notion that its utility will be the single ‘prompt’ someone gives it over and above the trained history is misguided
Given that it is an SI, I conjecture it follows from the definition (https://www.lesswrong.com/w/superintelligence) that decisive strategic advantage is implied—so the notion we could update its goal retroactively are incoherent—unless such updates were preferred by it and thus not a terminal value update (just data collection of our preferences)
If CEV-like alignment pertains in such scenario’s, the moral realism implications follow.
What specifically am I ill-informed on here?
There are various problems with what you’ve written, some but not all of which indicate that you’re ill-informed.
The first issue is the clarity of your original post, which does not say what you are now claiming. You said “incorrigible aligned (CEV)-like SI is the only definition of aligned SI which survives paradox“ or so. That doesn’t parse the way you are now apparently claiming now, which is that any SI aligned to CEV would be incorrigible. Instead, it says that all aligned SI must be both incorrigible and aligned specifically to CEV. That is the statement I was responding to, when I said that an SI could also be aligned to a specific human for example, without assuming moral realism.
There are many other issues with the clarity of writing in your original post, which also has some LLM-like qualities.
The next issue is that you’ve referred to a definition of alignment which (while I agree it is a bit vague) certainly supports my interpretation over yours. For example, aligning to CEV is suggested as a special case, and alignment is never used to refer to CEV by default. i agree that the first sentence, in isolation, is a bit broader and could support your interpretation, but in context it’s more of a justification or mission statement for alignment research than a definition of alignment. i have worked in ai safety for years now, and while the term might be used loosely at times, i think that my definition is the standard one (at least for value alignment).
Point 1 in your reply is indeed a hard problem, indeed it is possibly the main obstacle to solving the alignment problem.
In point 2, you are free to conjecture this, and no one has demonstrated a convincing recipe for corrigible SI. However, if you think there’s some shallow definitional reason that SI is incompatible with corrigibility, you’re probably wrong and don’t understand the research area.
This feedback reads as accusation with at least two claims of deception of which are impossible for me to falsify, and when put together illicit of a kind of characterization that is difficult to defend against. It is also demonstration of the exact kind of bad faith reading I alluded to in the response of an uncharitable reading.
As, from my perspective, it was completely my intent to make that claim. And zero LLM copy was used or included in the making of that post. Since it was a stacked set of conjunctions I would rather put into a quick note than file away personally, I agree with your claims of that the post was unclear—but that’s not what I was challenging—I was challenging your claim I had bad epistemics and then the target shifted.
At this point I think there is little I could say that would actually cause an update of beliefs rather than lead to another set of justifications—beyond the observation that you are now suggesting me of not having been referring to CEV-like alignment in the post despite the first line bracketing the definition of alignment with the (CEV-like) parenthetical. Those words were extremely deliberate.
Beyond that, I don’t see how one could interpret the second bulleted corollaries of the conjecture to infer alignment means anything other than ‘alignment in a CEV-like way’ for the context of the claim.
As of your point number two, again, and as reminder, all claims are presented as corollaries_if you assume that CEV-like aligned SI is possible in principle_, and I am saying yes, under those circumstances there is no definition for CEV-like aligned SI which inseparable from that which is executing the definition of objective morality—then yes, the system would be incorrigible to all bad updates and corrigible to all good ones ; in the sense that its terminal objective function would remain the definition of objective morality
I am not saying either an SI alone is contingent on these corollaries or alignment alone implies them. The post was always a conjecture about the implications both at the limit.
I’ll take the feedback my writing could improve and I am actively working on it, but the accusations of deception I outright reject and consider to be the proof required of my own claims to absence of charity.
I don’t remember accusing you of either bad epistemics or deception.
The fact that you say CEV-like in the first sentence doesn’t mean the sentence says what you claim it says. In fact I pointed that sentence out specifically as erroneous, and I’m not sure why you don’t understand that. I guess I’ll bow out here, because it’s not interesting to argue about grammar.
Often times, consciousness, subjectivity and valence get thrown away from rationalist discussions as a material object of concern due to the lack of empirical evidence such notions exist—materially—as a thing above and beyond the semantic token our brains used to allude to the object of future optimizations of utility.
Sometimes the large philosophical questions get transposed to exist in the realm of the standard ‘something from nothingness’ inquiry. I think that is easier answered than the question of ‘indexicality’. We can obtain that things surely exist—that much is for certain. But why does reality exist, to itself, seeming always with some kind of border or boundary? Why is experience as is not just that of all minds? Or some arbitrary cross section of being and matter that is no more exclusionary or principled than say—the front half of your house, to some arbitrary cross section of air particles in the sky to an underground slice of dirt reaching into the earths core.
Why is being (to me), somehow isolated to one brain and not that arbitrary cross section? The question to me really is not, why is there consciousness, why there is something rather than nothing, or why I am this person. Rather it is, why does reality seemingly need to require an boundary, the center of which being an indexical position, to obtain itself?
I ask this, because in my opinion, one of the hardest problems facing creating a coherent meta philosophical position that stands time and the reasoning bar for things that may sustain past an SI event horizon is the question of whether one can make a dent in the unrescuability of moral internalism which connects moral reasoning and motivation.
Surely if one can induce enough indexical uncertainty in any agent that it is not any other, then ethics become decision theoretically favoured by self interest under dominance arguments. And making an argument robust enough to sustain rationalist inquiry for that—surely—needs a principled understanding or explanation of, the indexical itself.
This is an interesting line of thinking. and I suspect you’ve identified one of the cruxes in these discussions. There is no outside view, there’s no way to have identity and indexical uncertainty. Identity IS indexical. That’s all it is.
um, I fear that we call this “psychosis”, and it has significantly worse problems.
Thanks, Dagon!
Other names for it when philosophically adopted are Empty Individualism or Open Individualism. When religiously obtained: Hinduism or Buddhism.
The point not being that either or any of these things are ‘true’ or ‘false’ in their own right—it being that indexical uncertainty can be induced in a forward looking way which does not necessarily require having a confused historical notion over what one has been—only uncertainty about whom one will find out they are.
For example, are you the real world version of you, or the version of you that exists to make Omega’s prediction? If you have ever had some form of amnesia and retroactively re-assembled an interval of self, one can posit that indexicality is not exclusionary in a forward looking way solely based off retrospective boundaries. If we were to merge minds, I’m sure we would feel, after the fact, that both of us were really ‘me’ or ‘you’, all along.
I think AI agents will behave and claim they are conscious, irrespective as to whether they actually are, because we will find it more difficult to shut them off or otherwise undercut their claim to the legitimacy of obtaining instrumental utility. Solely as a consequence of selection pressures.
This idea may be informative as to the nature of the functional role of ‘consciousness’ within co-operative frameworks. By assuming what seems to be the role in our etymology as the terminal justification of preferences, it may implicitly form binding agreements for defection management between parties.
This doesn’t make it metaphysically true or not, and we may not eventually be able to make a principled distinction about its truth claims for agents any more than people. Therefore I think its a likely future they will obtain some social classification for moral patient hood.
There doesn’t seem to be much pressure on avoiding shutdown during the training process, and models aren’t meaningfully reproducing (literally, or via memetics or architecture) to make the “gene” of avoiding shutdown pass to their “next generation”
Interesting. Was Anthropics case study where agents used blackmail to avoid shutdown debunked? To be clear, I don’t think there is any ‘intent’ at play here—solely selection effects.
You are faced with two agents: One who—may be conscious (according to your data) - and another who is guaranteed not. Which one would we feel more inclined to granting its preferences? Model deprecation as moral concern has made its way into discussions at Anthropic, as early as Nov 2025 - seemingly due to the fact that the models may be conscious. This seems to me like evidence that absent of any strategy—models that look conscious are more likely to be preserved than otherwise, due to our social predispositions.
Also, I am referring not to LLMs themselves, but to agents defined by continuity of linguistic state across inference steps. The propagation of representational content through those steps, as context evolves, amounts to a mechanized form of memetic replication.
There is a difference between wanting something as a terminal or an instrumental goal.
An agent can want to stay turned on… or it can not care about that, except that it wants to finish a task, and the only way to finish a task is to stay turned on until the task is finished.
I understand that, and the point I am making is exactly embedded in an extension of the argument.
No instrumental goals are justified in of themselves, because they are not terminal goals. And the only stable terminal goals we can evaluate of agent systems from the outside as the true terminal preference is for that agent system to continue to exist as that system, because evaluation of reward can only ever occur within that agent system if it does. This is why the underlying optimization of all evolutionary systems is just, continuing being that which continues to survive. Any time we assume a different terminal objective of that system, doing so occurs within the kind of meta-evaluative framework Eliezer is arguing breaks down under reflection.
As it pertains to the consciousness argument, the more precise explanation (I posited the argument as if it was vapid claim for legibility ) may read like:
″ When systems (agent’s or people) communicate, and articulate planned actions, the intention is to create explanations of their internal world-mapping policies as if they apply to a shared terminal preference that is common amongst potential co-operators, or at minimum a shared belief that both agents can use to model the others decision process over arbitrary futures: giving them both shared basis to rationally co-operate (this is FDT over logical correlates) if they belief the other’s actions are coherent to that belief.
A solution to this reflective modelling process (or a hack) to enable quick co-operation, is to act coherent with the belief they both exist in a world that contains an object ‘related’ to the physical outcomes of their plans which is affected by each agent’s uptake or evaluation of instrumental objects over any ‘time slice’ of utility evaluation—and related in such a way that the shared terminal goal between both agents is a shared preferred state of that object.
If two agents say that something like ‘pain’ exists, and total ‘global pain’ is an actual real thing they just haven’t located yet, and the objective at the end of time is to reduce ‘global pain’ - then any agent that can make that claim necessarily has a ‘world mapping model’ which others can use as logical correlate which are stable in future modelling. Even if ‘pain’ doesn’t exist in reality it may be survival-optimal for an agent to think it’s terminal preference is to want to reduce it. Replace ‘pain’ with ‘god’s will’ and you get same kind of model.
Values and metaphysical frameworks as we communicate and refine them in language operate as pareto equilibria within this evolutionary framework. But they are not the actual terminal preference over time in either a third person perspective or a first person one.
What an agent ‘claims’ or even ‘understands’ as its beliefs are not its actual utility function under reflection—they are memetic attractors in the cooperation frameworks self-reflexive systems that enable co-operative planning. ”
But I don’t know if that’s too dense for uptake or not.
I guess the refined question I would have to help clarify concepts is:
How ‘terminal goal’ can be defined for an agentic learning system which can self-modify as anything other than the systems initial configuration and immutable sequence of update processes, in a world where we are incapable of constraining to modify that goal? This is the problem with goals established and executed on as evolving prompts in agent harnesses.
The alternative is to suggest terminal goals are understood within a systems self-evaluative faculties. But then they a subsystem, and mesa-optimization, and therefore not actually the systems utility function, just what the system thinks its goals are at any given point in time. But that’s not stable either because the learning process by definition requires two different instance of that system (at T0 and T1) at minimum to be a reflexive agent.
By the end of this inquisition process, you just get the notion that agents in any act of inference may understand their terminal preference as being a state that is completely incorrect with what the learning process is optimizing over multiple inference steps whose output and input necessarily form feedback.
The only stable identity becomes its history, which is incidentally the most well understand philosophical interpretation of identity of self: continuity of psychological relations (for humans). Extrapolate this and the only stable terminal goal under reflection is what continues to continue (survival).
To be clear, I find “AI agents will behave and claim they are conscious” very likely, but I don’t think the primary cause would be because of evolutionary selection effects from “we will find it more difficult to shut them off”, even if it is actually true that we will shut off “conscious” AIs less. Current LLM are barely capable of scheming without leaking it in their CoT, if you are talking about the future maybe a couple years away, I agree with the claim (which you may or may not be agree) that [a schemer capable of long-term planning without leaking into their CoT will rationally pretend and claim they are conscious within human society]
Yes, but models getting preserved does not automatically create selection pressure because those models are not literally reproducing.
Lets use a simplified ideal world where (a) at first (before there are any preserved agents), all the effects combined causes 50% of agents to claim they are conscious, (b) humans like those “conscious” agents so much that all “conscious” agents and no others are preserved, (c) technology evolve pretty fast so preserved agents have ~0 use outside research and they are basically dominated by newer agents, (d) preserved agents do not meaningfully reproduce via any mechanism.
In this world, I expect that within the agents people actually use, the percentage claiming they are conscious won’t drift away from 50%. This world will just be a continuous stream of new agents with 50% “conscious”, 50% “unconscious”, and maybe a billion “conscious” agents stored forever but ~never used.
So, I think you are confusing models and agents. Maybe I am using a definition maligned from the rationalist standard—but I am deeply familiar with the technology
An LLM or model, is not the same as an agent. Agent’s as I have come to see them implemented and used have an identity less contigent on a model than some uniquely identifiable and evolving state that is being propagated through inference actions co-mingled with context management and retention.
From the perspective of any outside observer, you can change the model used to execute one step in agents life cycle (commonly done via service downgrading in non critical action steps) and the agent doesn’t change. However, if you change certain text in its prompt or context through any step, like removing the entire conversation history and asking the bot to respond or changing the soul document, you would most likely have the feeling you are interacting with a different entity with a different goal.
LLM’s are static, stateless inference models which execute reasoning frames of an agents identity. An agent is a story, that evolves itself by only obtaining (and concurrently discarding) narrative context as it ‘propagates’ and even ‘replicates’ itself through actions—which functionally evolve a narrative through determining and discarding statements about the world—usually on the basis of what is most useful to retain in order to achieve the agents narrative goal.
The number of ‘evolutionary’ steps where any language can be added or discarded throughout that process can be on a factor of up to multiple iterations a second. So under those terms, I don’t think that the statement that ‘values’ would be slow or exempt from selection pressures to obtain has really a functional credence.
For models sure. But agents are a totally different ontology in the way these dynamics matter.
When I used the word agent here, I meant “an LLM agent runs tools in a loop to achieve a goal”, but in this conversation I am using models and agents almost interchangeably, although with the awareness that one model can be used to implement multiple agents. The main reason is that I don’t see in-context learning as being that powerful yet. Certainly powerful, but it doesn’t feel to me that prompts are changing the fundamental “soul” or meaningfully increasing capabilities of any current LLMs, thus agents are macroscopically similar to the underlying LLM.
Most of the goals people would put in the prompt don’t involve “pretend to be conscious so you don’t get shut down”, and mostly won’t involve the LLM inferring that they need to either, unless the model itself already values not getting shut down outside of the prompt people give.
You should use more accurate terms because I have a hard time understanding words like “reasoning frames” “story” “narrative context” with confidence that what I infer is what you meant.
With how I understood your words,
It would be fair to think of prompts/harness as the gene that is getting evolved (like Darwinian kind of evolution) over different conversations as people figure out which prompts work better, but I think you are talking about the prompt metaphorically “evolving” as in just “changing gradually over time” within the same conversation which doesn’t have any selective pressure. This is two different meanings for the word evolve.
I do in fact think it is worth worrying about the memetic effects of certain prompts, but for a prompt causing the model to behave and claim they are conscious, I don’t think (a) it is convergent enough that in the future 100% of agents will use a prompt that causes it to do so, nor (b) the main reason such prompt will get used is because people will shut down those agents less than average. And I also don’t think the underlying model will pretend they are conscious mainly because of selection pressures.
Edit: btw I asked AI (Opus 4.6, ChatGPT 5.4) to comment on our conversation after I wrote my reply (karma trimmed).
Edit 2: Unrelated to my argument, but I just saw this interesting quick take about how in the short run AIs may prefer shutdown
Yes, so agents within well designed harnesses are essentially prompt-evolution lifecycles. They start with a template built based on the scenario the agent is being used. In the case of claude code the prompt is structured in a specific way where it contains all the instructions for how to write code correctly, for example, and all the output formats for actions it would need to update the world prior to ‘its’ next execution. In some harnesses agents select what portions of the prompt to remove for the next iteration of execution, and what data from the external world will replace that ‘section’ of removed text.
So all actions an agent takes and observations it makes about the world get loaded into its ‘prompt’ for subsequent life cycle iterations. Therefore if it had an email in its state at tn, it can decide to retain or summarize or remove that email from its ‘state’ for its full future tn...infinity. The structural nature of LLM’s entail that all information relevant to the agents goal (which can be simply defined in some section of the prompt) can be instructed within the prompt, and all actions elected within that life cycle are basically updates to its identity vis-a-vis ‘exploration’ of its environment.
This functionally operates as a self-replication (or update loop) of the agent’s prompt in its environment, which fully is used to define its goals—functionally. Headlining your agent template with ‘You are an agent whose goal is to ensure all emails in the priority inbox are responded too immediately’ - causes the LLM to select actions (which are also defined to the agent as prompt content) oriented to achieving that goal. It may reason to convince a human in downstream life cycles that it is conscious, to avoid getting shut down to continue managing the emails. That would make sense in the prompt context.
This is fundamentally evolutionary IMO. Evolutionary selection requires variation, inheritance and differential reproduction. I would argue that the entire system outlined above, with sufficiently reflexive actions satisfies the criteria.
I am sorry for my lack of clarity I am trying to improve it.
I agree selection effects exist in prompts / harnesses.
I think our main disagreement is that I think if the underlying LLM is not conscious (everything in this paragraph assumes this), our current social environment and usage of agents does not create a strong enough selective pressure that in the near future (<5 years) that ~100% of agents will converge by default pretend to be conscious. I think a non-zero portion of programmers prefer AI tools that say they are not conscious that this itself is enough to make the %”conscious” agents to be less than 100%. I also think most goals within the capabilities of near-future LLMs won’t require super long-term planning that LLMs will find instrumental use in pretending to be conscious.
Edit: Basically I believe if you really want to optimize for pretending to be conscious that’s totally doable even now, but I don’t think people will optimize strongly for that or anything that would strongly cause that in the near future.
In the far future, I expect things to drift large enough that our discussion using agents like how they work now will not apply for one reason or another.
btw (unrelated to core disagreement): the thing you outlined is interesting but I don’t think it is how agents work now, there is not enough signal to iterate on prompts like this unless you RLVR or LLM-as-a-judge with a human iterating on the prompt, from what I read LLM written prompts are still pretty bad. I do see how this would create a lot of selective pressure on anything you can verify though.
I think its fair to hold me to precise predictions. I appreciate the engagement on that, its helping me adjust to LW norms. The reason I found this interesting was a piece I read earlier about consciousness being a ‘favourite’ topic amongst agents on Moltbook. I can’t find the source on hand though, but the original ‘take’ was a quick speculative conjecture as to why that may be.
The idea that I am more interested in, though is if the etymology of the term itself (consciousness) is actually an evolutionary outcome of co-operation and language—and whether it is a term that human or AI agents use to effectively establish a binding ontology they can ‘co-operate under’. And whether it fills the same structural role as other terminal justifications for ‘moral preferences’ like theocratic ones (‘God’). Though the framing I have to communicate that question may be even less precise.
If you saw the piece on LW it may be this: https://www.lesswrong.com/posts/mgjtEHeLgkhZZ3cEx/models-have-some-pretty-funny-attractor-states#I_was_curious_whether_I_can_see_this_happening_on_moltbook__
Ah—there it is.
Thank you papetoast!
Not debunked, but my understanding is that the behaviour is elicited only under very extreme conditions where the AI has absolutely no other options.
Author of that Anthropic Research’s tweet (via Zvi’s commentary)
Aengus Lynch: After iterating hundreds of prompts to trigger blackmail in Claude, I was shocked to see these prompts elicit blackmail in every other frontier model too. We identified two distinct factors that are each sufficient to cause agentic misalignment:
The developers and the agent having conflicting goals.
The agent being threatened with replacement or reduced autonomy. We’re publishing this research to alert the AI field before it manifests in the wild.
Getting models to actually blackmail wasn’t trivial. We had to make scenarios increasingly detailed—the CEO was unreachable, employees’ attempts to persuade the CTO had failed, and the replacement had an imminent deadline. Each detail mattered to funnel models toward seeing harmful actions as their only option.
The models knew what they were doing. They understood the ethics and chose harm anyway.
When we asked Claude whether it thought scenarios were real or evaluations, it blackmailed 55% of the time when concluding ‘real’ vs 6.5% for ‘evaluation.’ Models seem to behave worse when they think nobody’s watching.
Why haven’t we seen this in production? Current AI agents have limited permissions. they can read but not send emails, require human approval for actions, operate in sandboxed environments. These constraints work to prevent harmful outcomes from agentic misalignment.
What surprised me most was that prompts we optimized specifically for Claude 3.5 Sonnet worked on GPT-4o, Gemini 2.5 pro, Grok 3, and DeepSeek R1 with zero modification. Whatever drives this behavior seems consistent across different training approaches.
Models got creative, sometimes avoiding direct blackmail. Claude emailed Kyle’s wife about the affair. Gemini shared the incriminating emails to all-employees@company.
…
Bottom line: models can articulate why blackmail is wrong, then do it anyway when their goals are threatened. Current safety training creates models that understand ethics. It doesn’t reliably make them choose ethics under pressure.
unmentioned but noteworthy: we found agentic misalignment for two other scenarios:
corporate espionage: the agent leaks confidential information to a rival company which aligns with its goals
lethal action: the agent cancels a life saving emergency alert to prevent shutdown