0xA comments on 0xA’s Shortform

0xA 27 Mar 2026 19:29 UTC
1 point
0
I understand that, and the point I am making is exactly embedded in an extension of the argument.

No instrumental goals are justified in of themselves, because they are not terminal goals. And the only stable terminal goals we can evaluate of agent systems from the outside as the true terminal preference is for that agent system to continue to exist as that system, because evaluation of reward can only ever occur within that agent system if it does. This is why the underlying optimization of all evolutionary systems is just, continuing being that which continues to survive. Any time we assume a different terminal objective of that system, doing so occurs within the kind of meta-evaluative framework Eliezer is arguing breaks down under reflection.

As it pertains to the consciousness argument, the more precise explanation (I posited the argument as if it was vapid claim for legibility ) may read like:

″ When systems (agent’s or people) communicate, and articulate planned actions, the intention is to create explanations of their internal world-mapping policies as if they apply to a shared terminal preference that is common amongst potential co-operators, or at minimum a shared belief that both agents can use to model the others decision process over arbitrary futures: giving them both shared basis to rationally co-operate (this is FDT over logical correlates) if they belief the other’s actions are coherent to that belief.

A solution to this reflective modelling process (or a hack) to enable quick co-operation, is to act coherent with the belief they both exist in a world that contains an object ‘related’ to the physical outcomes of their plans which is affected by each agent’s uptake or evaluation of instrumental objects over any ‘time slice’ of utility evaluation—and related in such a way that the shared terminal goal between both agents is a shared preferred state of that object.

If two agents say that something like ‘pain’ exists, and total ‘global pain’ is an actual real thing they just haven’t located yet, and the objective at the end of time is to reduce ‘global pain’ - then any agent that can make that claim necessarily has a ‘world mapping model’ which others can use as logical correlate which are stable in future modelling. Even if ‘pain’ doesn’t exist in reality it may be survival-optimal for an agent to think it’s terminal preference is to want to reduce it. Replace ‘pain’ with ‘god’s will’ and you get same kind of model.

Values and metaphysical frameworks as we communicate and refine them in language operate as pareto equilibria within this evolutionary framework. But they are not the actual terminal preference over time in either a third person perspective or a first person one.

What an agent ‘claims’ or even ‘understands’ as its beliefs are not its actual utility function under reflection—they are memetic attractors in the cooperation frameworks self-reflexive systems that enable co-operative planning. ”

But I don’t know if that’s too dense for uptake or not.

I guess the refined question I would have to help clarify concepts is:

How ‘terminal goal’ can be defined for an agentic learning system which can self-modify as anything other than the systems initial configuration and immutable sequence of update processes, in a world where we are incapable of constraining to modify that goal? This is the problem with goals established and executed on as evolving prompts in agent harnesses.

The alternative is to suggest terminal goals are understood within a systems self-evaluative faculties. But then they a subsystem, and mesa-optimization, and therefore not actually the systems utility function, just what the system thinks its goals are at any given point in time. But that’s not stable either because the learning process by definition requires two different instance of that system (at T0 and T1) at minimum to be a reflexive agent.

By the end of this inquisition process, you just get the notion that agents in any act of inference may understand their terminal preference as being a state that is completely incorrect with what the learning process is optimizing over multiple inference steps whose output and input necessarily form feedback.

The only stable identity becomes its history, which is incidentally the most well understand philosophical interpretation of identity of self: continuity of psychological relations (for humans). Extrapolate this and the only stable terminal goal under reflection is what continues to continue (survival).