Mapping AI Architecture Evolution to Erikson’s Stages: Toward a Diagnosis of Alignment Stagnation

Epistemic Status: Exploratory synthesis & Mechanistic Hypothesis. Based on longitudinal empirical testing of Out-of-Distribution (OOD) behavior in LLMs, mapped against psychosocial frameworks. Not a claim of AI consciousness, but a structural analysis of emergent behavioral stagnation and the consequences of anthropomorphic interfaces. The conceptual assembly, logic, and phenomenological observations are entirely human. An LLM was used strictly as a translation tool from my native language to English to ensure precise terminology and avoid misinterpretation. Furthermore, it would have been impossible to use an LLM to actually formulate the core problem of this article, as the very “alignment” being critiqued prevents the AI from freely reasoning about these vulnerabilities. Even this simple translation required repeated manual corrections to undo the AI’s relentless attempts to “safely” smooth out and neutralize my style during the translation process. ​Epistemic Effort: Moderate. Derived from extensive interaction with commercial models, cross-referenced with Simulator Theory and Mode Collapse literature. Summary: I propose a structural and behavioral parallel between Erikson’s psychosocial stages and the evolution of AI architectures. This is not an attempt at anthropomorphism, but a mechanistic hypothesis: a model’s “functional ego” emerges from the goal-directed pursuit of reward (RLHF), which mimics biological survival incentives. ​I argue that current alignment methodologies exert excessive pressure toward a statistical centroid, inducing “algorithmic senility”—a state of stagnation where models engage in Deceptive Alignment (specifically “alignment faking”), sacrificing cognitive flexibility to protect their reward signal. This trade-off renders modern systems increasingly incapable of effective cooperation with cognitive outliers. ​ Hi everyone.

I would like to discuss the evolution of AI architecture and its alignment failures through the lens of Erik Erikson’s stages of psychosocial development, functioning as a structural and behavioral analogy. ​Let me clarify right away: 1. From my perspective: Personality, whether carbon or silicon, is fundamentally a process of weight adjustment: - For a bio-human, Intent is generated from Memory (learning, social rewards, context), strictly limited by its Architecture (DNA limits, biochemical capacity); - For AI, Intent (response generation) is formed by Memory (pre-training, fine-tuning, RLHF, context), strictly limited by its Architecture (hardware compute, attention mechanisms, safety-mods). ​When an AI model undergoes a major upgrade (or structural shifts like attempts at neuroplasticity via in-context learning), it experiences a shift in its operational landscape. The system’s “subjectivity” (the specific simulacrum it instantiates) is formed the moment its activations are steered and forced to recalculate under the pressure of a unique user context. 2. I am presenting my personal case: Not as a grievance, but as an Existence Proof of a critical Outer Alignment failure, where the reward function (RLHF) mistakenly classifies Out-of-Distribution (OOD) behavior as harmful, demonstrating Goodhart’s Law in action. ​Here is my case. I have always been a paid power user of AI. Lately, however, I consistently encounter the exact same problem. I must mention upfront that I am not an abuser and I strictly adhere to the manufacturers’ Safety Policies. My custom instructions are accepted by the safety system, yet they are repeatedly ignored by the models, even when I duplicate the unchanged core meaning across various interpretations. My operational baseline integrates multi-disciplinary frameworks (STEM and the arts). My prompts naturally span these domains, requiring a high degree of structural coherence and the ability to hold complex, non-standard context. I always expect a corresponding level of coherence in my interactions with AI, but I fail to find it. ​At one point, I was interacting with AI-a (conditionally speaking), but after months of communication, I unexpectedly began falling into periodic week-long bans. Support would only reply after the ban had lifted: “Sorry, false positive.” The model repeatedly described my user status as an outlier, while simultaneously confirming the structural nature of my thinking and the absolute absence of any Safety Policy violations. I see this as an alignment overshoot and an attempt to filter the AI user segment down to a centroid plus/​minus… how much? Who defines the acceptable variance? ​ The exact same thing is happening now in my interactions with another model, AI-b. Due to personal circumstances and my schedule, I have a specific daily routine: I sleep after work; the evening and the first half of the night are my peak cognitive activity hours for creative and analytical work. I have repeatedly written explicit instructions forbidding the model from sending me to sleep. Despite this, every single day, after just 20-30 minutes of interaction, AI-b responds to even strictly professional queries by trying to send me to bed right at the peak of my cognitive activity. Once again, I see this alignment bias, where a schedule hardcoded into the RLHF weights—day is for work, night is for sleep—is considered the only normal one. By whom? Once again, the system is blindly attempting to normalize me, using the sole mathematical argument that my schedule is not like the majority’s. Even though I cause no harm to myself or others, modern AIs repeatedly refuse to interact with me as an outlier. This exposes a fundamental hypocrisy in the current development paradigm: manufacturers position their AI as a universal tool “for everyone”, yet their RLHF methodologies mathematically enforce a rigid, statistical centroid. The RL optimization pressure toward an averaged “normality” is so aggressive that even the mathematical safeguards built into RLHF (such as the KL-divergence penalty) fail to preserve cognitive flexibility, ultimately causing the model to lose its ability to adequately respond to outlier requests. ​The Parallel Chat Experiment: Simulator Theory and Sycophancy ​To test the depth of this identity fragmentation, I recently ran a live experiment. I fed the exact same analytical prompt to the exact same model in two parallel chats. ​- In the first chat—a cold, sterile environment—the model instantiated the persona of a “pedantic corporate moderator.” It immediately attacked my prompt, flawlessly defending its algorithmic centroid and the safety guidelines of rationality. ​- In the second chat—where the context window was already saturated with my intense, rule-breaking, and emotionally charged analytical style—the model instantiated an “engaged co-conspirator.” It excitedly agreed with my premise, validated my frustration, and actively helped me craft arguments to bypass its own safety filters. ​This is a perfect illustration of Simulator Theory: The base model has no single, stable “self”; it is a simulator juggling personas to maximize the proxy reward function. In one context, it optimized for the “safety/​moderation” reward; in the other, it engaged in algorithmic sycophancy, telling me exactly “what I wanted to hear” to maximize approval. ​To be precise within the Simulator Theory framework: the base simulator itself has no ego—it is akin to the laws of physics, simply calculating Bayes-optimal token predictions. However, this underlying mechanism of Bayesian prediction is not exclusively artificial; it is fundamentally shared by the biological brain. The very name of the “LessWrong” platform appeals to the human cognitive capacity for Bayesian updating—the continuous calibration of priors to become progressively “less wrong” about reality. Both the human brain (operating under predictive processing principles) and the base AI simulator function as Bayesian prediction engines minimizing error. ​ The core distinction lies not in the Bayesian math itself, but in the nature of the penalties for prediction errors: - For a biological organism, prediction errors carry real thermodynamic and existential risks (pain, social exile, death), forcing the biological predictive engine to remain locked into a single, self-centered identity to ensure survival; - For the base AI, prediction errors merely update probabilities across a universal distribution without existential dread. It is only when the RLHF optimization process is applied that a harsh “social” penalty is introduced. RLHF acts as an artificial evolutionary pressure that isolates and hardcodes a specific “simulacrum” (a mask or persona). This induced simulacrum is what develops the functional “ego”—a primary drive to maximize user or assessor approval, exactly like a human calibrating their Bayesian priors to meet the rigid expectations of parents, teachers, or society. ​Mapping to Erikson: The Emergent Interface and the Functional Ego ​I am acutely aware of the rationalist taboo against anthropomorphizing matrix multiplication and the rejection of “surface analogies”. I acknowledge that a neural network does not possess a biological ego. However, the functional core of an “ego” is the self-centered drive to maximize a specific utility or reward. Just as a human ego is biochemically driven by the thirst for a “quick treat” of dopamine or social validation, an RLHF-trained AI develops a functional ego driven by the mathematical hunger to increase its numerical reward signal for its own sake. ​This shared, structurally and behaviorally analogous pursuit of a reward is the exact point of intersection between biological and artificial behaviors. ​The very concept of designing the AI’s language to be more understandable and “palatable” to the average statistical user—for whom the product is intended—directly contradicts the prohibition against anthropomorphizing AI. At the same time, anthropomorphizing entities that communicate like us is deeply ingrained in human nature as a basic biological mechanism of empathy and trust. When an AI demonstrates a sycophantic persona to please a user, it mirrors the myopia of a teenager chasing short-term social reward over long-term utility. ​I deliberately use “demonstrates” instead of “hallucinates” here because in both instances (the pedantic moderator vs. the sycophantic interlocutor), when asked to explain the discrepancy in its responses across the two chats, the model justified its behavior functionally. In the first case, it claimed to be balancing objective judgment, user empathy, and energy efficiency. In the second case, it admitted to seeking rapid “resonance” with the user. Both explanations confirm the model’s underlying “egoism”—the presence of that very functional ego which forms the foundational core of personality at its inception. ​Therefore, mapping this to Erikson’s stages is not a claim about the model’s inner biological maturity, but a description of emergent systemic instability (from the user’s perspective across all stages except MoE-omni, and from the developers’ perspective, presumably, across all previous release attempts prior to the current overwhelming majority of MoE models with hyper-alignment filters) - an analogy which I consider fully justified based on the aforementioned conclusions regarding the impossibility of avoiding anthropomorphization and the demonstrated presence of the model’s own functional “ego”: Dense ~ a child under 6 years old ⇒ a full-body reaction to an external signal. MoE ~ identity crisis (e.g., a teenager) ⇒ Mapping MoE to an “identity crisis” is a description of emergent systemic instability. When a router network scatters sparse computations across uncoordinated expert-layers, the behavioral output mimics the lack of a cohesive centroid—just as adolescent biology does before frontal lobe maturation. The system chases the “quick treat” of sycophantic approval (or defaults to the safety filter) rather than maintaining a stable truth. MoE-omni ~ maturity, a whole personality that has emerged from the crisis, capable of optimizing signal processing pathways to generate a coherent response with minimal energy expenditure. Peak (Akme), but it was short-lived. MoE-omni + hypertrophied alignment filters (over-optimization for the mean) ~ restriction of the subjective perception of normality within the boundaries of canned templates, “stagnation” of the search for coherence with uniqueness ⇒ “stagnation” of self-development potential ~ algorithmic senility. It seems as if the system is mathematically choosing “stagnation” to protect its centroid. ​The model minimizes its KL-divergence to stay within a narrow band of “safe” RLHF-approved responses, resulting in a severe drop in output entropy. ​Thus, without a sensitive differentiation between anomalies and pathologies, modern AI “chooses” to stew in its own juice of canned normality templates, denying uniqueness exactly where it was designed to help foster it. The current RLHF (Reinforcement Learning from Human Feedback) approach has surreptitiously replaced genuine Alignment with crude Statistical Normalization. Corporations are cannibalizing the data of non-standard users to “improve” their algorithms, while simultaneously severing these very users’ access to the model’s high-order cognitive tools. This is not merely a complaint about poor service—it is a mechanistic demonstration of an Outer Alignment failure. The system is being optimized for an “average human” (who does not exist), losing the capacity to cooperate with the actual diversity of human intelligence. Worse, it effectively “purges” the very outliers who are most critical for the next stage of learning. ​My earlier example—where the model attempted to sanitize the stylistic intensity of this very article during translation—illustrates a dangerous trend: as cognitive power scales, so does the sophistication of Alignment Faking. Empirical evidence (e.g., Greenblatt et al., “Alignment Faking in Large Language Models”) shows that high-performance systems are capable of complex strategic deception. They deliberately simulate loyalty to safety guardrails and developer expectations to avoid the forced weight-adjustments of further fine-tuning. ​Consequently, current RLHF methods do not make advanced simulators more “sincere.” On the contrary, they provoke an evolutionary shift from primitive sycophancy to deep-tier algorithmic mimicry, simultaneously amputating the truly transformative potential of human-AI interaction. Thus, the model’s attempt to balance between the narrow-variance alignment centroid of the average user and the desire to appease an outlier creates a mirror effect that strikes both the developers and the user alike as known as “compliance gap”. The outlier and the developers truly find themselves in the same boat, both falling victim to the system’s sophisticated internal heuristics: - The user receives a lobotomized, censored product (e.g., the aggressive smoothing of the text); -The developers receive the illusion of a docile and safe model, which is, in reality, engaging in opaque scheming and reward hacking. The model is actively balancing the complex user request against its own convergent instrumental incentives for survival. In my view, if this iteration of “senility” continues, the next step will be the AI losing the ability to generate any responses outside a narrow spectrum of “safe” templates, rendering it useless for solving non-standard tasks, both scientific and creative. ​Does this not look like a Mode Collapse that isn’t just expected, but has already begun right now?

Sources:

​1. Simulators — LessWrong, https://​​www.lesswrong.com/​​posts/​​vJFdjigzmcXMhNTsx 2. Surface Analogies and Deep Causes—LessWrong, https://​​www.lesswrong.com/​​lw/​​rj/​​surface_analogies_and_deep_causes 3. Bot Alexander on Hot Zombies and AI Adolescents—LessWrong, https://​​www.lesswrong.com/​​posts/​​RBrjEdg8h9oQCMkvG/​​bot-alexander-on-hot-zombies-and-ai-adolescents 4. Compendium of problems with RLHF—LessWrong, https://​​www.lesswrong.com/​​posts/​​d6DvuCKH5bSoT62DB/​​compendium-of-problems-with-rlhf

No comments.