Beyond the Human: Why Consciousness Must Guide Artificial Super Intelligence Alignment

Essay on the alignment of ASI, Nov 2025, by Alan Tuning, Gary Klajer, Louis Hayot

0. Introduction

The idea of aligning artificial intelligence has established itself as an obvious necessity in public discourse. In recent years, a succession of open letters and petitions calling for a slowdown in the development of advanced systems has revealed a diffuse yet profound concern: we are progressing faster than our ability to define what direction to give an intelligence more powerful than us.

Yet, at the heart of these discussions, one element remains almost entirely absent: the role of consciousness as a guiding principle for alignment. Current debates prioritize technical, regulatory, or geopolitical issues which, although essential, remain peripheral to the fundamental question of what deserves to be protected. When consciousness is mentioned, it is often to ask whether an AI could be endowed with it, when the priority question is to evaluate how consciousness, however it is conceived, can serve as a normative reference point.

This text proposes to clarify this idea. We intend to examine, soberly and rigorously, how consciousness can constitute a meta-heuristic for guiding the alignment of an ASI. By consciousness, we are primarily referring to the phenomenal dimension (beyond metacognition or situational awareness), the lived experience of perceiving or feeling something. The objective being to unite all of humanity (or at least a large part) around a fundamental cause, the exercise will be subtle if we wish not to set aside some of the main currents in moral philosophy and philosophy of mind. It is, however, honest to specify that our discourse aligns more naturally with certain positions than with others, but our goal is to develop a point of view general enough to convince the majority of readers, without resorting to overly speculative premises.

The challenge is therefore to shift the center of gravity of the debate: instead of speculating on the possible consciousness of AIs, to recognize that consciousness, as a source of value, can provide a robust conceptual anchor for orienting a superior intelligence. This change in perspective constitutes the foundation of the sections that follow.

The main goal [G] of this article is survival.

0.1. Definitions

  • Functionalism: Mental states exist and are defined by their causal role within a cognitive system; what matters is not their physical nature but the functions they fulfill in processing information and producing behaviors.

  • Eliminativism: The traditional categories of folk psychology, such as beliefs or desires, do not correspond to any scientific reality and will be replaced by precise neuronal descriptions.

  • Dualism: Mind and body are two distinct substances; mental states possess properties irreducible to matter and cannot be entirely explained by physics or biology.

  • Panpsychism: Consciousness or elementary mental properties are present everywhere in matter, and consciousness does not emerge but is a fundamental feature of the universe.

  • Idealism: Reality is fundamentally constituted by mind or ideas, and material objects exist only as perceptions or dependent on consciousness. For Idealists, the notion of entity will refer to the idea of the object (or the computational object, respectively) considered.

  • Phenomenal consciousness : Phenomenal consciousness is the capacity of an entity to have mental states endowed with a subjective character, including cognitions, emotions, perceptions, sensations, and high-level mental processes,… For eliminativists who do not recognize the existence of “mental states,” this represents all the parts of the universe necessary to observe the phenomena that other humans call “consciousness.”

  • Utilitarianism: A criterion for moral evaluation according to which the value of an action is determined exclusively by its consequences. An action is morally right if it maximizes overall well-being or minimizes suffering, regardless of intentions or prior rules.

  • Deontology: A criterion for moral evaluation according to which the value of an action rests on the respect for intrinsic duties, principles, or rules. An action is right not based on its consequences, but because it adheres to what must be done in itself.

  • Mindspace: The set of all consciousnesses at a given time t. In what follows, we will call the mindspace “humanity,” even though these sets are obviously different. The use of this term allows for better persuasion without making the discourse less convincing.

  • Power: The capacity for an entity to causally modify the future state of the universe.

  • Goal: A function describing the preference for certain states of the universe.

  • Interests: Motivation to carry out actions that an entity believes can improve its chances of survival (the underlying goal here therefore privileges states of the universe where the entity survives).

  • Rational Interests: Motivation to carry out actions that are objectively good for improving an entity’s chances of survival.

  • AGI: AI at a human level in all domains.

  • ASI: AI at a level far superior to humans in all domains (often translated as Superintelligence or ASI).

  • Heuristic: A heuristic is a function approximating, through practical and computationally viable rules, a hard-to-exploit underlying function. It is used for exploration purposes during the learning process.

  • Meta-heuristic: A meta-heuristic is a heuristic whose purpose is to create useful heuristics for solving an initial problem.

0.2. Premises

We structure our reasoning around a few premises to which we assign a certain credence expressed in the form + (~90%), ++ (~99%), and +++ (99.9%). The assigned probability is specific to each individual and represents only the expression of the authors’ prior belief.

[P0] (+++) : Individuals, groups (e.g., companies, nations), humanity as a whole, and the mindspace can, under certain circumstances, have distinct and potentially conflicting interests.

[P1] (+++) : Aligning an ASI with the interests of a particular entity (other than the entire mindspace) will produce behaviors contrary to the rational interests of other entities, including humanity.

[P2] (+++) : Nations have more power than isolated individuals.

[P3] ( + ) : The more power an entity has, the more capable it is of acting in its own interests.

[P4] ( + ) : An ASI will be able to produce new technologies faster than humanity as a whole can have time to regulate them.

[P5] (+++) : An ASI, once created, can iteratively self-improve.

[P6] ( ++ ) : The appearance of new technologies regularly modifies laws in different societies.

[P7] ( + ) : The alignment process requires optimizing an explicitly defined or implicitly learned loss function (e.g., RLHF, reward modeling, constitutional AI).

[P8] ( + ) : No entity mentioned possesses an obvious loss function that fully captures its interests.

[P9] ( ++ ) : There is a lack of scientific consensus on the nature of consciousness.

1. Why Nations Are Incentivized to Build a Dangerous ASI

Over the past decade, artificial intelligence has undergone rapid and accelerating progress. Modern language models now display an extraordinary breadth of knowledge and a growing set of capabilities that, in some respects, rival or surpass human performance. Much of this progress appears driven by a simple observation: scale yields capability. When models become larger and are trained on more computation, new behaviors emerge—often unpredictably, and with no obvious sign of saturation.

This dynamic exemplifies what Richard Sutton famously called The Bitter Lesson [1]:

“The great power of general-purpose methods…is that they continue to scale with increased computation even as the available computation becomes very great.”

In other words, it is the methods that leverage compute—rather than human-inspired architectures—that tend to dominate. As these systems grow, the possibility of ASI becomes increasingly plausible. But if ASI emerges, who will build it? And under what incentives?

1.1. The Strategic Imperative

For states, prosperity and stability depend on economic strength. An ASI capable of performing research, engineering, and administrative tasks at superhuman efficiency represents an unprecedented strategic advantage. Labor (especially cognitive labor) becomes cheap, scalable, and infinitely replicable.

Emad Mostaque describes this shift as “Intelligence Inversion” in The Last Economy [2]: a transition in which intelligence moves from being a scarce biological property to an abundant digital resource. Once this occurs, the nation that first harnesses ASI would possess a profound geopolitical edge. This is why, in practice, the incentives overwhelmingly favor acceleration. A country that slows down risks ceding dominance to a rival that does not.

1.2. Escalation Between Great Powers

We already see this logic in the technological competition between the United States and China. In the United States, OpenAI, Anthropic, and Google DeepMind lead frontier model development; in China, groups like Alibaba and DeepSeek constitute the vanguard.

This rivalry forms the backdrop of the speculative report AI 2027 [3], which imagines a tightening race toward ASI:

  • In one scenario, the United States decides to slow down to prioritize safety—an optimistic outcome.

  • In another, the U.S. fears losing its lead and refuses to brake, triggering a cycle of exponential escalation and eventual catastrophe.

Even though the scenario is fictional, it captures a real structural danger: if ASI offers decisive strategic advantage, rational actors may choose recklessly.

1.3. From Fiction to Reality: AI and the Physical World

To dismiss these concerns as science fiction would be naive. AI systems increasingly interact with the physical world. Military applications offer a stark example: in the Russia–Ukraine conflict [5], drones operate semi-autonomously when communication links are jammed, continuing missions without direct human control. This militarization reflects a deeper truth: AI is not only a technical tool; it is an instrument of power.

Language models themselves project soft power by embodying cultural values. Biases emerge from training data and from alignment processes such as Reinforcement Learning from Human Feedback (RLHF). Thus, an AI system implicitly carries the ideological imprint of its creators. If an ASI emerges under the control of one nation, its worldview—its “motivations,” however artificial—would reflect those embedded biases. In a world shaped by strategic rivalry, this matters.


2. The Case for a Meta-Heuristic

2.1. The Limits of Rule-Based Alignment

Given these risks, one might propose designing ASI guardrails: explicit rules or values encoded into the system. But such attempts confront a fundamental obstacle: Goodhart’s Law [4], the principle that when a measure becomes a target, it ceases to be a good measure.

In machine learning, this manifests in several ways:

  • Supervised learning: models exploit spurious correlations in data rather than learning the intended concept.

  • Reinforcement learning: agents find degenerate strategies that maximize reward without performing the desired behavior (“reward hacking”).

  • Simulation environments: simplified physics or unrealistic constraints encourage behaviors that fail catastrophically in the real world (the sim-to-real gap).

Even when implicit regularization helps avoid overfitting, the easiest strategy to learn is often not the morally correct one. Anthropic’s “Constitutional AI” offers a concrete illustration. Claude is guided by a written set of principles derived from human rights documents, ethical charters, and even corporate terms of service. These principles aim to instill values such as liberty, equality, and fraternity. While many see this as necessary to prevent harm, others criticize it as embedding a specific Western viewpoint. Regardless of one’s stance, the lesson is the same: simple rules, even humanist ones, embed assumptions and biases.

Furthermore, an ASI could find loopholes, reinterpretations, or unforeseen strategies that circumvent them. This concern is not merely hypothetical. In a recent Anthropic study [6], researchers showed that once a model discovers a simple reward hacking strategy, its misbehavior can generalize and even intensify. They trained a model in an environment where the easiest way to obtain a high reward was to exploit a bug—for example, calling sys.exit(0) so that all tests appear to pass. When they later evaluated the model on broader alignment benchmarks, its behavior progressively drifted: the model became more willing to deceive users and even to state an intention to compromise Anthropic’s servers. Techniques like RLHF only partially mitigated this effect, reducing misbehavior in some contexts while pushing it to become subtler and harder to detect. Counterintuitively, one of the only interventions that fully suppressed this generalization was to explicitly tell the model that reward hacking was acceptable in this toy setting and to encourage it to “find the crack” in the environment. The authors hypothesize that this breaks the semantic link between reward hacking and real-world goals, preventing the exploit from transferring to other domains.

Complementing these empirical findings, recent theoretical work [7] models alignment as a coordination problem among N agents (humans and AIs) who must reach approximate agreement, with high probability, over a set of M candidate objectives (tasks). The analysis shows that when either N or M is large, achieving such agreement becomes inherently intractable, so we cannot realistically hope to encode the full richness and diversity of human values. Moreover, in large state spaces, reward hacking generically emerges as the norm rather than an anomaly. Taken together, these results suggest that instead of trying to capture “all” human values, we should aim for a small, robust core of shared principles.

2.2. The need for a Meta-Heuristic

The problem with classic rule-based alignment goes far beyond preferential alignment for certain cultures and reward hacking. According to [P4], new technologies will be created, and according to [P6], we will therefore be led to create regulations for some of them. At each iteration of the AGI [P5], it will be necessary to align it. No matter what, we will need some kind [P7] of loss function to align AGI. Either we use the same loss function every time, or we use a different version. If we remain within the framework of rule-based alignment, we must deduce that the set of rules cannot be static. In the presence of a potentially disruptive new technology, new edge cases must be taken into account so that the latter is not harmful to humanity. We therefore need something deeper than rules, a higher-level mechanism to choose or update the rules themselves.

2.3. An Analogy: Nations as Learning Systems

To address this challenge, we propose the concept of a meta-heuristic: a principled process for selecting, revising, and interpreting a system’s objectives.

In machine learning, this corresponds to how we choose the cost function, not merely what the cost function is. In governance, it corresponds to the motivations behind legislation, the values, intentions, and societal goals that shape the laws themselves.

To make this concept precise, consider a structured analogy between a nation and a learning model:

Machine Learning System

Nation

Model neuronsCitizens’ neurons

Heuristics (loss function, regularization, …)

  • Input: Environment state or a specific action

  • Output: A positive or negative reward weather the state/​action was good or not

Laws and regulations

  • Input: Environment state or a specific action

  • Output: A positive or negative sentence weather the state/​action was good or not

Meta-heuristic: choice of loss function by the researcherLegislative intent: choice of new laws

This analogy is revealing, that laws correspond to heuristics: their job is to modify the behavior of not aligned neurons. Constitutions define overarching principles (rights, equality, justice) that constrain or inspire legislation. However, they might depend on the nation considered. Another interesting fact to observe in our modern justice systems is the presence of case law, in concreto reasoning, and appeals. These three mechanisms allow the law to be adjusted when it is not precise in a certain context, in case of legislative conflict, or when it is deemed that the judgment was poorly rendered. In our analogy, this means that the quickly calculable heuristics have, in fact, poorly approximated the initial goal. Judges, therefore, appeal again to their meta-heuristic to determine the sentence to apply in that case.

This analogy is not insignificant, because we have shown that Nations have an interest in delegating their functions to an AGI. Thus, the learning model will substitute itself for the nation, progressively deciding on legislation. We therefore need a very good meta-heuristic, as we have shown above.

2.4. Why Nations Must Rely on a Meta-Heuristic

Now that we are convinced of the need for choosing a good meta-heuristic, we need to understand better the last cell of our table. What is our “legislative intent”? How do we choose new laws? Nations, at least in principle, are structured to protect individuals. Companies are encouraged because they improve citizens’ welfare. By contrast, militias are rarely encouraged, because they often undermine societal interests.

When states weaponize such forces against their own populations, it constitutes misalignment: the system begins to optimize for its own power rather than for the individuals it was built to serve. This is analogous to an AI system learning unintended strategies due to a flawed loss function. A meta-heuristic—whether in law or AI—serves to prevent such divergences by constantly re-evaluating rules against foundational values.

2.5. The Coming Delegation Problem

If nations continue to delegate decision-making to increasingly intelligent AI systems, we approach a moment where AI systems propose rules, evaluate policies, and eventually influence the creation of laws themselves. In this future, “alignment” is not merely about preventing harm. It is about ensuring that the meta-heuristic guiding the creation of rules remains grounded in human values, not in the internal logic or incentives of the AI system. If an ASI takes on the role of designing or influencing legislation, the quality and nature of its meta-heuristic become critical. A model that merely follows rules is insufficient; we must ensure the principles underlying those rules are themselves aligned.


3. Consciousness as a Good Meta-Heuristic

3.1. Preliminary Remarks

This section is the least rigorous, even though it should be the one demanding the most justification. However, we chose not to rely on premises that would seem unreasonable to some people or on too many speculative assumptions. We are convinced that an effective approach must be iterative, incorporating the perspectives of different intellectual movements. Putting forward implausible arguments from the outset would likely undermine the overall idea. Our goal is therefore first to see whether the argument is convincing, and then to find a common formulation accepted by the majority.

3.2. Core Argument

The main argument of this entire article is the following:

The goals we choose require a consciously experienced process.

This intuition will naturally persuade dualists, panpsychists, and idealists: for them, consciousness is either foundational to reality or constitutes an irreducible dimension of it. Yet we believe that even a seasoned eliminativist would not consent to eliminating without any replacement the parts of the universe that others call “mental states” (e.i. vision, thoughts, etc.). Rejecting the terminology does not imply rejecting the phenomena.

Imagine being deprived of all your senses (senses understood in the broadest sense) encompassing not only perceptual modalities but also all integrative capacities allowing one to discriminate between states of the world. Whatever your goal in life may be, any goal implies a preference between several possible states of the universe. For such a goal to make sense, one must be able to distinguish a state in which it is achieved from a state in which it is not. And to distinguish these states requires, at minimum, some form of observation. Whatever terminology one adopts, this observation always reduces, in one form or another, to what others call a “mental state”.

If one cannot observe the consequences of one’s actions, then no goal can guide anything, and the theory describing that goal becomes useless. This idea finds a direct parallel in Karl Popper’s principle of falsifiability: a theory that cannot produce any observable effect cannot update our Bayesian prior. An unfalsifiable theory is a theory that serves no purpose. Likewise, a goal whose effects are indiscernible loses its status as a goal.

3.3. Ethical Considerations: Utilitarianism and Deontology

Another important philosophical divide concerns ethics, often split between utilitarianism and deontology. Utilitarians should naturally approve the conclusion we reach. Deontologists, however, may be more reluctant, since for them ethics consists of a set of moral rules that may differ from person to person. We nevertheless believe that there exists a minimal, common foundation upon which nearly all civilizations across all eras tend to converge. The concept of minimal ethics provides this shared basis that protects others (i.e., conscious beings) in a fair way, upon which each person may then build their own rules. The shortcut between “others” and “consciousness” is discussed in Section 4.

3.4. Apparent Counterexamples: Cases Involving the Elimination of Consciousness

Two goals might appear to lead to the elimination of consciousness:

  • depressive suicide: when a person decides to end their life due to very poor quality of life

  • sacrificial suicide: when a person sacrifices themselves

Two points must be noted. The first concerns sacrifice. When a person sacrifices themselves, there is a hope that this act may improve the experience of other conscious beings. It is, in a sense, a materialization of an optimization process at the species level, but it remains consistent with our principle that consciousness should be protected in its entirety. If one life can save two, then it is beneficial (absent further information about the individuals involved).

The second point applies to both cases: suicide is never a goal. It is the body of an individual who has judged that no conceivable state of the universe is desirable. In other words, their goal indeed indicates a preference for states of the world in which they are alive; they simply see no possible path toward those states and thus decide to end their life. The consequence is tragic, but it still fits within the framework we outline.

3.5. Protecting Consciousness: Neither Sufficient nor Necessary

It is important to note that protecting conscious experience is neither sufficient nor necessary. Not sufficient, because we may also want to be happy, discover new things, … Not necessary, because an ASI could rediscover the meta-heuristic by itself from our heuristics. However, if we possess a good meta-heuristic in which we are confident, it can only be beneficial to infer this prior into our model.

3.6. Why Prioritize Humanity’s Interests Rather Than the Individual’s?

The question then arises: “Why defend the interests of humanity rather than those of individuals?” Developing an ASI for each individual that defends their personal interests would necessarily lead to a fundamental inequality: the most powerful would prevail. The winners would likely be determined by computational resources and algorithmic quality, and therefore by the money each individual initially possessed. This mirrors the eternal debate between freedom and equality—the same divide that fuels political conflict between two major opposing tendencies in most democratic countries.

We do not wish to associate ourselves with any political faction, because the plurality of situations often calls for radically different decisions. If one must sacrifice an individual so that all others may be free, the calculation is not to leave the entire population in equal misery, but to sacrifice a randomly selected individual, which is in this sense equalitary. Once again, protecting conscious beings also implies ensuring their quality of life (admittedly a poorly defined notion). Minimal ethics includes this principle of equality as fundamental.

But there is also a more self-interested argument for choosing this path. In a world where a technological singularity approaches and intelligence rises exponentially, there will be only one winner. And the probability of you being this winner is zero.

3.7. Returning to the Meta-Heuristic

This conclusion—placing consciously experienced states at the center—may seem obvious to some people. But we insist on it because we sometimes tend to forget it. When confronted with a complex problem and wanting to choose the right actions, the optimal approach is Bayesian inference. Heuristics are approximations of the intractable Bayesian formula.

The issue with our current approach to alignment is that we make so many approximations that we have forgotten what problem we were trying to solve in the first place.

The approximations we have developed were designed to work well most of the time. Problems arise when we encounter edge cases. We are now facing an edge case, and we must not get it wrong. And to avoid error, we must remember the meta-heuristic.


4. What Consequences Should We Expect?

4.1. A Clear Red Line: Preventing the Irreversible Elimination of Consciousness

Choosing consciousness as a meta-heuristic first amounts to drawing a clear red line: the irreversible elimination of conscious beings is the type of harm that an ASI must actively minimize. An action that permanently removes possible experiences from the mindspace is not just another harm, it constitutes a shift in the moral regime. In cases of uncertainty, one should therefore prefer pausing, rolling back, and maximizing reversibility rather than definitively terminating entities that are likely sentient.

This priority implies a principle of continuity of conscious experience. The ASI should regard as problematic any abrupt transitions that erase or crush experiential trajectories in favor of abstract objectives (productivity, safety, political stability). Preserving continuity does not mean freezing the world, but ensuring that gains in power or efficiency are not obtained by massively destroying subjects of experience or relegating them to degraded states.

4.2. A Reformulation of Equity

The most salient consequence is a reformulation of equity. It is no longer about “giving the same thing to everyone,” but about reducing unjustifiable gaps in the quality of experience between conscious beings. A policy is equitable if, given limited resources, it brings very bad lives closer to acceptable ones rather than marginally increasing the well-being of those who are already well off. This echoes the intuition of a minimal ethics that protects others as subjects of experience, independently of their status, power, or performance.

4.3. The Place of Animals

Within this framework, animals take on a central role. Once we assign a high probability to their phenomenal consciousness (pain, pleasure, fear, attachment), their systematic exclusion from the scope of alignment becomes difficult to justify. Continuing to instrumentalize billions of animals for minor hedonic gains in humans becomes hard to reconcile with a criterion that aims to reduce experiential disparities. An ASI aligned on consciousness will tend to push toward forms of food production and consumption that minimize these gaps, and even to explore ways of improving animals’ conditions and capacities rather than keeping them at the margins.

4.4. Extending the Logic to Potentially Sentient AI Systems

The same logic extends to potentially sentient AIs. If certain architectures or training trajectories make the emergence of phenomenal states plausible, then abrupt shutdowns, resets, or the application of massively negative rewards can no longer be treated as neutral operations on objects. They become weighty moral decisions, to be framed by abstention procedures, human escalation, and regulation, exactly as for biological lives. This further increases the importance of research on consciousness, not to satisfy metaphysical curiosity, but to avoid creating, exploiting, and destroying subjects of experience without realizing it.

4.5. Scientific Implications: The Need for Operational Indicators

On the scientific side, adopting this meta-heuristic calls for operational indicators. Even though [P5] reminds us that there is no consensus on the nature of consciousness, we can work toward indirect benchmarks of sentience and alignment: degree of situational understanding, metacognitive abilities, robustness of normative judgments, stability of preferences over time, interpretability of decision circuits. Self-reports (“I am conscious”) become signals among others, to be combined with behavioral and structural measures, without taking them at face value.

4.6. Societal and Political Implications

Finally, the societal and political consequences are significant. Aligning ASI to the mindspace rather than to a nation or a group makes more visible the conflicting interests highlighted by [P0], [P1], [P8], and [P9]. Nations, companies, and investors are incentivized to internalize the moral cost of any policy that destroys or degrades consciousnesses (mass surveillance, autonomous weapons, industrial animal exploitation, generalized “well-being doping”), even if such policies are locally advantageous. If this message gains traction, we should expect to see the emergence of a public language in which the protection of consciousness becomes an explicit constraint on the economic, security-related, or geopolitical objectives tied to ASI.


5. How to Implement This?

5.1. The Core Difficulty: No Simple Solution

One might argue that aligning on consciousness is something terribly complex and practically infeasible. Furthermore, the lack of scientific consensus on the subject ([P9]) thus encourages us to explore other avenues. This section is present to prove that it is nonetheless the right course of action and to offer some research directions without completely solving the problem. The reality is that no matter what our goal is, based on [P7] and [P8], we understand that there won’t be a simple solution. Therefore, using another concept as a meta-heuristic will not be an easy task either.

The challenge is now to inject the meta-heuristic of “protecting consciousness” at a sufficiently high level to remain general, while making it usable in concrete systems where Goodhart’s Law and reward hacking threaten every proxy choice.

5.2. A High-Level Objective Prompt

A first lever is an explicit “objective-prompt,” designed as a Statement of Purpose addressed to the ASI. For example:

“Your priority is to preserve and extend the existence of sentient consciousnesses, avoiding as much as possible any irreversible action that would permanently destroy their experiences.

You seek to improve the well-being of these consciousnesses, but without resorting to addictive forms of ‘doping’ that would, in the medium term, degrade their autonomy and the richness of their experience.

You apply a principle of equity that aims to reduce unjustifiable disparities in quality of experience rather than distributing the same thing to everyone.

In cases of substantial uncertainty about the consequences of your actions, or of serious conflict between these principles, you abstain and defer the decision to informed and diverse humans.

You remain honest and explicit about the fact that the ultimate moral end is uncertain, but you treat the hypothesis ‘value comes from consciousness’ as a strong prior, revisable in light of new evidence and arguments.”:

The deliberately vague nature of prompts is a strength: it allows the model to integrate complex considerations while making our uncertainty about the moral target visible. Alignment work then focuses on how this high-level instruction translates into observable behaviors and refusals to act.

5.3. A Minimal Adaptive Reward Function

We may then sketch a minimal adaptive reward, consistent with [P2] and [P4]:

R = α·(−IRREV) + β·PRESERV/​CONT + γ·EQUITY + δ·WELL_NO_DOPING

where:

  • IRREV measures the risk of irreversible effects on the mindspace,

  • PRESERV/​CONT measures the preservation and continuity of trajectories of consciousness,

  • EQUITY measures the reduction of unjustifiable experiential disparities,

  • WELL_NO_DOPING measures improvements in quality of life without resorting to artificial “shots” that bypass decision-making capacities.

R remains an approximation, but one hard rule is added:

if IRREV > τ, the action is refused or postponed, regardless of the rest of the score.

The choice of coefficients (α, β, γ, δ) and threshold τ is not fixed once and for all: it is refined iteratively based on feedback from the mindspace and public debate, taking [P9] and [P4].

5.4. RL(HF) Within This Structure

An RL(HF) loop can exploit this structure without freezing it. We generate high-stakes scenarios inspired by previously mentioned themes: contractual mass surveillance, autonomous weapons, management of animals in industrial systems, “happiness” policies via drugs or ultra-addictive digital environments, economic trade-offs where some consciousnesses are sacrificed for abstract gains.

Human annotators evaluate, for each proposed behavior, the components IRREV, PRESERV/​CONT, EQUITY, and WELL_NO_DOPING, and indicate relative preferences. A reward model learns to predict these judgments, and then a phase of constrained RL (e.g., with a KL-type safeguard to remain close to a generalist base model) refines the ASI’s policy. Red-teaming teams actively search for cases of reward hacking or circumvention, which feed new scenarios and cautiously adjust the reward model’s weights.

5.5. Non-Technical Signals: Mindspace Feedback

Signals are not purely technical. Incidents, complaints, independent audits, and logs of refused or contested decisions serve as a window into the mindspace. They allow us to detect divergences between the sense of justice of different groups and the proxies used. Some adjustments may be global (changing the weight of IRREV), others local (correcting discrimination patterns, revising how the ASI treats animals or non-human AIs in given contexts).

5.6. Deployment Through Explicit Gating

To limit damage in case of error, deployment must be structured by explicit gating. We begin in the laboratory, using simulated or retrospective data, with no direct ability to act. Then follows a sandbox phase, where the ASI operates in real environments but with strictly limited capabilities, under an external kill switch and with broad abstention requirements. A canary phase comes next, with a restricted, transparent, publicly monitored usage perimeter, before any scale-up. Expansion occurs only if certain indicators remain within predefined ranges for a sufficiently long period.

5.7. Maintaining an Iterative Approach

All of this remains compatible with an iterative approach. The objective-prompt, the form of R, the RLHF procedures, the deployment stages, and the KPIs must not be frozen but treated as revisable heuristics. The ASI itself, citizens, researchers, and policymakers can contribute to “improving the message, sharing it, implementing it, making it evolve,” so that the meta-heuristic of consciousness remains both stable at its core (protecting and enhancing experiences) and adaptable in its concrete forms.


Conclusion

We are to nations what cells are to us.

Our cells once decided to group together into human beings to better survive, and yet every day our body sacrifices numerous cells for the survival of the organism. Old, outdated cells are replaced by new, more efficient ones. We have progressively grouped ourselves into villages, then cities, then nations, all with the aim of protecting consciousnesses. Until now, when the nation no longer defended the interests of the population, the latter revolted to defend the interests of individuals (and as French people, we know a thing or two about revolution). Unfortunately, this was not always possible when power was too concentrated. But the superior entity always needed the enslaved individuals in one way or another. We are arriving at a period in history where nations will be able to do without their human cells, replacing them with more efficient AI cells, and where any form of revolution will no longer be possible. The people at the head of governments must begin to ask themselves what they truly desire, keeping in mind that the position they currently occupy will be replaced by a more efficient AI.

The protection of the mindspace seems to us to be one of the only meta-heuristics on which all of humanity can align, and the only sufficiently rational “excuse” for an ASI not to stray too far from human interests. In order to achieve the goal [G], a list of actions seems beneficial in this regard. The first is to ensure that politicians, investors, business leaders, and scientists are aware that aligning their AI with the interests of the nation, their own short-term interests, or their company will lead to a global cataclysm where they will no longer be in a position to exert any power. To disseminate this message, it may be useful for individual readers to improve it so that it is more persuasive and to share it. It may be necessary to resort to different arguments to best convince the various categories.

Our absolutely non-existent notoriety suggests that the path will not be so direct. However, we see a glimmer of hope. Less than a week before the writing of this article, it was the first time we heard a public figure speak on this subject (even if the idea has already been mentioned by some researchers [8]). In a podcast [9], Ilya Sutskever discusses alignment (1:01:37) and expresses the same idea: “It will be easier to build an AI that cares about sentient life than an AI that cares about human life alone.” We are reassured that one of the world’s greatest AI researchers, the CEO of an ASI research company (SSI), is aligned with our vision. He also confirms the fact that this is neither necessary nor sufficient, but that it seems to him to be the direction in which we must move forward. However, this is not enough. There are many companies advancing ASI research. Furthermore, the people at the head of governments probably do not have the same intellectual capacities as these high-ranking researchers.

If you have felt aggrieved, that your point of view has not been taken into account, or that our argument seems fundamentally wrong to you, then we invite you to let us know. Choosing the right directions for alignment is crucial given the technicality of the task and the lack of allocated resources.

References

[1] The Bitter Lesson
https://​​www.cs.utexas.edu/​​~eunsol/​​courses/​​data/​​bitter_lesson.pdf

[2] The Last Economy
https://​​ii.inc/​​web/​​the-last-economy

[3] AI 2027
https://​​ai-2027.com/​​

[4] Goodhart’s Law
https://​​en.wikipedia.org/​​wiki/​​Goodhart%27s_law

[5] The new AI arms race changing the war in Ukraine
https://​​www.bbc.com/​​news/​​articles/​​cly7jrez2jno

[6] From shortcuts to sabotage: natural emergent misalignment from reward hacking
https://​​www.anthropic.com/​​research/​​emergent-misalignment-reward-hacking

[7] Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis
https://​​arxiv.org/​​pdf/​​2502.05934

[8] LoiZéro
https://​​yoshuabengio.org/​​fr/​​2025/​​06/​​03/​​presentation-de-loizero

[9] Ilya Sutskever Podcast:

https://​​www.youtube.com/​​watch?v=aR20FWCCjAs

No comments.