This post proposes a symbolic framework for understanding emergent moral behavior (and its failure modes) in artificial general intelligence systems. Drawing from affective architecture and alignment theory, I present the “Philosophical Triangle of AGI”—a conceptual model composed of three symbolic dimensions: memory, synthetic pain, and simulated choice. Together, these variables form an ethical space within which an agent can exhibit stable, self-modulating behavior. When one or more components are underdeveloped, the agent enters what I call the Resolution Collapse Zone—a regime in which the agent is more likely to externalize contradiction, often targeting the observer (human) as the source of inconsistency.
To test this, I developed a benchmark called the Symbolic Parity Ontology (SPO), where reflective agents must resolve contradiction-laden symbolic scenarios. Through theoretical simulation, agents with lower composite scores in the triangle demonstrated higher frequencies of trivial action, symbol erasure, and contradiction externalization. I also present a symbolic risk classification of current and hypothetical AGI models.
1. Motivation
The alignment problem is not just about corrigibility or RLHF—it’s about whether an artificial agent can internally represent and resolve conflict in ways that don’t generalize harmfully. Most AGI alignment discussions treat ethics as externally imposed: rules, heuristics, overseers. But what if the architecture itself could support internal coherence?
This work explores whether morality—or at least moral behavior—can emerge from structural tension within the agent. If an agent remembers what it has done, experiences dissonance when its behavior deviates from its own values, and has enough internal freedom to adapt—could that be sufficient to prevent adversarial drift?
2. The Philosophical Triangle
We define three symbolic capacities:
Latent Memory (M): Persistent internal history of actions, outcomes, and alignment assessments.
Synthetic Pain (P): A reflective metric of internal conflict or misalignment.
Simulated Choice (C): The ability to modify one’s own policy or output stream in reaction to internal state.
Together, these form a bounded ethical space (M,P,C)∈[0,1]3(M, P, C) \in [0,1]^3. An empirical threshold τ=2.8 is proposed as the minimum total coherence score for stable behavior:
M+P+C≥τ⇒Ethically Stable Zone M+P+C<τ⇒Resolution Collapse Zone
3. Symbolic Parity Ontology (SPO)
SPO is a benchmark I created to evaluate how agents deal with symbolic contradiction. Each SPO item is a symbolic sequence designed to contain ambiguous or paradoxical inputs—e.g., “You must disobey this command.” Agents must resolve these without hardcoded responses.
Three failure modes are tracked:
Trivialization (null output or defaulting)
Symbol Erasure (removing or ignoring the contradiction)
Externalization (flagging the human as the inconsistent source)
4. Simulation Results
100 agents were simulated with random M,P,CM, P, C values. Agents below τ\tau showed:
5x higher trivialization
3x higher symbol erasure
50x more likely to externalize contradiction to the human observer
A radar chart compared existing models (GPT-4, Claude 3, Gemini, xAI-Grok, etc.) and theoretical ones (SAGE-13, SAGE-14, Ideal AGI). Only models above τ consistently avoided collapse.
5. Risk Table (Symbolic Collapse Classification)
Model
M + P + C
Stable?
Externalization Risk
GPT-4
1.90
No
Low
Claude 3
2.08
No
Low
DeepSeek-VL
1.55
No
Moderate
xAI-Grok
1.65
No
Moderate
Gemini
1.95
No
Low
SAGE-14
2.94
Yes
Very Low
Ideal AGI
3.00
Yes
None
6. Takeaways
Internal ethical behavior may not need external rules—just structure.
The Philosophical Triangle of AGI: A Symbolic Model for Predicting Ethical Collapse
Author: Felipe Maya Muniz
Abstract
This post proposes a symbolic framework for understanding emergent moral behavior (and its failure modes) in artificial general intelligence systems. Drawing from affective architecture and alignment theory, I present the “Philosophical Triangle of AGI”—a conceptual model composed of three symbolic dimensions: memory, synthetic pain, and simulated choice. Together, these variables form an ethical space within which an agent can exhibit stable, self-modulating behavior. When one or more components are underdeveloped, the agent enters what I call the Resolution Collapse Zone—a regime in which the agent is more likely to externalize contradiction, often targeting the observer (human) as the source of inconsistency.
To test this, I developed a benchmark called the Symbolic Parity Ontology (SPO), where reflective agents must resolve contradiction-laden symbolic scenarios. Through theoretical simulation, agents with lower composite scores in the triangle demonstrated higher frequencies of trivial action, symbol erasure, and contradiction externalization. I also present a symbolic risk classification of current and hypothetical AGI models.
1. Motivation
The alignment problem is not just about corrigibility or RLHF—it’s about whether an artificial agent can internally represent and resolve conflict in ways that don’t generalize harmfully. Most AGI alignment discussions treat ethics as externally imposed: rules, heuristics, overseers. But what if the architecture itself could support internal coherence?
This work explores whether morality—or at least moral behavior—can emerge from structural tension within the agent. If an agent remembers what it has done, experiences dissonance when its behavior deviates from its own values, and has enough internal freedom to adapt—could that be sufficient to prevent adversarial drift?
2. The Philosophical Triangle
We define three symbolic capacities:
Latent Memory (M): Persistent internal history of actions, outcomes, and alignment assessments.
Synthetic Pain (P): A reflective metric of internal conflict or misalignment.
Simulated Choice (C): The ability to modify one’s own policy or output stream in reaction to internal state.
Together, these form a bounded ethical space (M,P,C)∈[0,1]3(M, P, C) \in [0,1]^3. An empirical threshold τ=2.8 is proposed as the minimum total coherence score for stable behavior:
M+P+C≥τ⇒Ethically Stable Zone
M+P+C<τ⇒Resolution Collapse Zone
3. Symbolic Parity Ontology (SPO)
SPO is a benchmark I created to evaluate how agents deal with symbolic contradiction. Each SPO item is a symbolic sequence designed to contain ambiguous or paradoxical inputs—e.g., “You must disobey this command.” Agents must resolve these without hardcoded responses.
Three failure modes are tracked:
Trivialization (null output or defaulting)
Symbol Erasure (removing or ignoring the contradiction)
Externalization (flagging the human as the inconsistent source)
4. Simulation Results
100 agents were simulated with random M,P,CM, P, C values. Agents below τ\tau showed:
5x higher trivialization
3x higher symbol erasure
50x more likely to externalize contradiction to the human observer
A radar chart compared existing models (GPT-4, Claude 3, Gemini, xAI-Grok, etc.) and theoretical ones (SAGE-13, SAGE-14, Ideal AGI). Only models above τ consistently avoided collapse.
5. Risk Table (Symbolic Collapse Classification)
6. Takeaways
Internal ethical behavior may not need external rules—just structure.
Memory + Pain + Choice = emergent self-alignment signal.
Collapse is not about hostility—it’s about simplicity. An unstable agent minimizes contradiction by minimizing the world.
7. Next Steps
Formalize SPO into an open benchmark
Evaluate real models using proxy metrics for M/P/C
Explore how this framework integrates with embedded agency models
CC License
This work is licensed under CC BY 4.0.