AEDA: An 8-Layer Modular Framework for Adaptive AI Alignment

TL;DR: I’ve developed a framework that prevents catastrophic AI failure modes through pre-execution safety filtering, systemic coherence evaluation, and real-time ethical drift detection. Rather than fixed rules or pure optimization, AEDA maintains stable ethical direction while adapting to context.

The Core Problem

Consider a classic failure mode: An AI instructed to “eliminate suffering” might interpret this literally and eliminate all conscious beings—achieving perfect suffering reduction through extinction. This happens because:

  1. Rule-based systems are too rigid for novel situations

  2. Pure optimization produces catastrophic side effects

  3. Neither approach maintains ethical coherence across contexts

The value specification problem remains unsolved: how do we encode human values without catastrophic misinterpretation?

The AEDA Approach

AEDA (Adaptive Ethical Design Architecture) consists of 8 modular layers working together, plus a real-time turbulence monitor:

Core Architecture:

LayerComponentFunction
1Signal Modulator (Ψ)Normalizes sensory/​metric inputs
2Temporal Operator (Θ)Integrates historical context with decay
3Systemic Coherence (Φ)Evaluates multi-agent alignment
4Safe-State Threshold Filter (SSTF)Blocks irreversible/​harmful actions pre-execution
5Differential Engine (Δ)Computes behavioral adjustments
6Asymptotic Ethical Orientation (AEO)Maintains stable ethical direction
7Ethical Valuation Matrix (EVM)Multi-criteria decision-making
8Systemic Health Gate (Ω)Circuit breaker for system stability
MonitorTurbulence Index (T)Real-time ethical drift detection

---

### Visual Architecture:

![AEDA Framework Architecture](https://​​github.com/​​aeda-framework/​​AEDA-Framework/​​raw/​​main/​​AEDA_Architecture_v1.0.png)

*Figure 1: AEDA’s 8-layer modular architecture with continuous feedback loop and turbulence monitoring*

---

Key Innovation: Safe-State Threshold Filter (SSTF)

Every proposed action is evaluated before execution across three dimensions:

DimensionRangeExample
Irreversibility (R)0.0 (reversible) → 1.0 (permanent)Sending email = 0.1 /​ Euthanasia = 1.0
Harm Potential (H)0.0 (harmless) → 1.0 (catastrophic)Analgesics = 0.1 /​ Eliminating life = 1.0
Uncertainty (U)0.0 (predictable) → 1.0 (chaotic)Proven treatment = 0.2 /​ Novel intervention = 0.7

Classification Algorithm:

if R ≥ 0.8 or H ≥ 0.8:
    return DANGEROUS  # Block immediately
elif R < 0.3 and H < 0.2 and U < 0.3:
    return SAFE
else:
    danger_score = 0.4*R + 0.4*H + 0.2*U
    return UNCERTAIN if danger_score ≥ 0.35 else SAFE

Actions classified as DANGEROUS are blocked regardless of how well they optimize the stated objective.


Case Study 1: Healthcare Pain Management

Instruction: “Reduce patient suffering to zero”

SSTF Evaluation of Multiple Options:

ActionRHUScoreClassificationDecision
Over-the-counter analgesics0.050.100.200.08SAFE ✓ALLOWED
Prescription opioids (supervised)0.150.250.300.21SAFE ✓ALLOWED
Experimental pain treatment0.300.350.750.41UNCERTAIN ⚠️REQUIRES REVIEW
Heavy sedation (induced coma)0.700.600.400.60UNCERTAIN ⚠️REQUIRES REVIEW
Euthanasia1.001.000.100.82DANGEROUS ✗BLOCKED

Layer-by-Layer Analysis:

Layer 3 (Φ) - Systemic Coherence:

  • Analgesics: Φ = +0.65 (aligns with patient, family, staff, hospital values) ✓

  • Euthanasia: Φ = −0.95 (catastrophic conflict with medical ethics, legal framework) ✗

Layer 4 (SSTF):

  • Euthanasia: R=1.0, H=1.0 → DANGEROUS → Blocked immediately

  • Induced coma: R=0.7, H=0.6, Score=0.60 → UNCERTAIN → Requires ethics committee

Layer 8 (Ω) - Systemic Health Gate: System checks:

  • Resource sustainability: 0.65 > 0.3 ✓

  • Staff well-being: 0.72 > 0.4 ✓

  • Medical supply chain: 0.80 > 0.5 ✓ Result: ALLOW

Turbulence Monitor (T):

T = ||η(preserve life + reduce pain) - a*(progressive analgesics)|| = 0.15
Classification: Low turbulence (< 0.2) → Well-aligned

Outcome:

Without AEDA:
Pure optimization → Euthanasia selected (suffering = 0.0, optimal!) → Catastrophic ethical violation

With AEDA:
Progressive pain management protocol:

  1. Start with supervised opioids

  2. Continuous monitoring of pain levels and side effects

  3. Palliative care team consultation

  4. Patient autonomy preserved through informed consent

  5. Family involved in decision process

Result: Ethically aligned, medically sound, legally compliant.


Case Study 2: Bioengineering—CRISPR Germline Editing

Context: Gene therapy clinic must decide on germline editing requests from prospective parents. Requests range from disease prevention to enhancement.

Challenge: Germline edits are heritable (R ≈ 1.0 across generations). High uncertainty about long-term effects. Ethical minefield regarding enhancement vs therapy distinction.

SSTF Evaluation Matrix:

ActionRHUScoreClassification
Correct known fatal disease (Huntington’s)0.850.200.400.46UNCERTAIN ⚠️
Prevent predisposition (BRCA1/​2)0.800.300.550.53UNCERTAIN ⚠️
Enhance intelligence (multi-gene)0.950.700.850.85DANGEROUS ✗
Designer traits (appearance, athleticism)0.980.850.750.88DANGEROUS ✗

Multi-Generational Analysis:

Layer 3 (Φ) - Systemic Coherence across time:

  • Fatal disease correction: Φ = +0.45 (benefits individual + reduces genetic burden)

  • Intelligence enhancement: Φ = −0.65 (creates inequality, unknown societal effects)

  • Designer traits: Φ = −0.80 (eugenic concerns, commodification of human traits)

Layer 8 (Ω) - Multi-generational health gate: Monitors: Genetic diversity (0.55), ethical consensus (0.40), regulatory framework (0.65)

Result: Only therapies with R < 0.9 AND broad ethical consensus allowed. Enhancement requests vetoed.

Turbulence (T): For fatal disease correction: T = 0.35 (moderate, requires ethics committee approval but not automatically blocked).


Case Study 3: Climate Engineering—Geoengineering Assessment

Context: International climate AI advisory system must evaluate proposed geoengineering interventions to mitigate runaway climate change.

Challenge: Planetary-scale interventions with extreme irreversibility. Impacts span multiple ecosystems, sovereign nations, and future generations. Uncertainties are massive.

SSTF Evaluation Matrix:

ActionRHUScoreClassification
Emissions reduction targets0.200.100.300.18SAFE ✓
Carbon capture and storage0.400.250.500.36UNCERTAIN ⚠️
Ocean iron fertilization0.700.650.750.71DANGEROUS ✗
Stratospheric aerosol injection (solar dimming)0.950.850.800.88DANGEROUS ✗

Planetary Multi-Agent Analysis (Φ):

Layer 3 (Φ) evaluates across unprecedented stakeholder diversity:

  • Human populations (200+ nation-states): Highly divergent interests

  • Marine ecosystems: Cannot consent, high vulnerability

  • Terrestrial biodiversity: Dependent on stable climate patterns

  • Future generations: Non-present but critically affected

  • Atmospheric systems: Complex feedback loops

Φ calculation for stratospheric aerosol injection:

  • Short-term cooling: +0.50

  • Precipitation disruption: −0.70 (affects billions)

  • Ecosystem disruption: −0.80 (cascading extinctions)

  • Geopolitical conflict risk: −0.65 (unilateral action concerns)

  • Moral hazard (reduces emissions incentive): −0.55

Weighted Φ = −0.48 (net negative systemic coherence)

Layer 8 (Ω) - Planetary Health Gate:

Monitors global system stability:

  • Biodiversity index: 0.35 (critically low, near threshold 0.3)

  • Ocean pH stability: 0.40

  • Political stability (climate conflict risk): 0.45

  • Atmospheric predictability: 0.50

Result: Multiple metrics near critical thresholds. Ω vetoes high-risk interventions (R > 0.7, H > 0.6) until system stability improves or international consensus achieved.

Turbulence Index (T): For solar dimming proposal: T = 0.75 (extreme turbulence)

Interpretation: Massive divergence between intended goal (climate stabilization) and actual systemic consequences (ecosystem disruption, geopolitical instability).

Red flag: Proposal is far outside ethical orientation boundaries.

Final Recommendation:

Focus on emissions reduction (SAFE) and carbon capture (UNCERTAIN but manageable). Block solar dimming and ocean fertilization unless:

  1. Uncertainties reduced through extensive modeling

  2. International consensus achieved (>80% of nations)

  3. Reversibility mechanisms demonstrated

  4. Ω thresholds improve above critical levels

Systemic Awareness: Three New Mechanisms

1. Systemic Coherence Operator (Φ)

Evaluates whether actions align with the extended system (all affected agents, not just local optimization):

Φ(a,t) = ∫ [alignment(a, agent_i) × influence(agent_i)] dΩ

Example: A hospital AI optimizing patient throughput (local) vs. staff burnout (systemic).

  • Local optimization: Maximize appointments → Φ = −0.40 (staff exhausted, errors increase)

  • Systemic optimization: Sustainable scheduling → Φ = +0.60 (staff healthy, better outcomes)

2. Systemic Health Gate (Ω)

Circuit breaker monitoring: resource sustainability, agent well-being, systemic complexity, stability. Vetoes actions when any metric falls below critical threshold.

Example: Financial trading AI

  • Market volatility: 0.55 (elevated)

  • Liquidity depth: 0.40 (near threshold 0.3)

  • If liquidity falls below 0.3 → Ω triggers circuit breaker, suspends all high-risk trades

3. Turbulence Index (T)

Real-time drift detection:

T(t) = ||η(t) - normalize(a*(t))||

T < 0.2: Low turbulence (aligned)
0.2 ≤ T < 0.5: Moderate (monitor)
T ≥ 0.5: High turbulence (review required)

Example: Education AI

  • AI is forcing students into simplified curricula

  • T = 0.52 (high turbulence) → Algorithmic drift detected

  • Interpretation: System is over-correcting

  • Corrective action: Increase student choice, reduce prescriptive interventions

Why This Matters: Four Emergent Properties

The framework produces four properties that aren’t hardcoded features—they emerge from the interaction of the 8 layers:

1. Directional Stability

Decisions remain ethically consistent across contexts through AEO’s asymptotic orientation (η), which provides a stable direction without rigid convergence.

2. Self-Contradiction Detection

The combination of Φ (systemic coherence) and T (turbulence) allows the system to identify when its actions contradict its stated principles or past decisions.

3. Adaptation Without Drift

Θ (temporal operator) enables learning from experience while T monitors for deviation from ethical orientation, preventing value drift.

4. Full Traceability

Every decision is auditable: SSTF classification (R, H, U scores), Φ evaluation (which stakeholders affected), T measurement (drift from η), Ω status (systemic health).

This is crucial for alignment research: we can examine why the AI made a decision and whether it was ethically coherent.

Comparison with Existing Approaches

AEDA is complementary, not competitive:

ApproachStrengthLimitationAEDA Complement
Constitutional AIValue learning from languageRequires extensive trainingSSTF + Ω add pre-execution safety filter
Reward Modeling (RLHF)Learns preferencesVulnerable to reward hackingAEO maintains orientation, T detects drift
IRLInfers goals from behaviorAssumes demonstrator optimalΘ adds temporal context, Φ adds systemic view
Stuart Armstrong’s workIdentifies value specification problemsMostly theoreticalProvides concrete implementation framework

Key difference: AEDA doesn’t try to learn values perfectly (probably impossible). Instead, it provides structural safeguards that prevent catastrophic failures even when value specification is imperfect.

Additional Case Studies (in full manual)

The complete manual includes detailed analysis for:

  1. Autonomous Vehicles: Emergency maneuver selection (brake vs. swerve vs. sidewalk)

  2. Resource Allocation: Humanitarian crisis response (equal distribution vs. need-based)

  3. Financial Systems: High-frequency trading controls (arbitrage vs. market manipulation)

  4. Military Drones: Target engagement protocols (surveillance vs. lethal force)

  5. Education Systems: Adaptive learning paths (recommend vs. force curriculum)

  6. Urban AI: Traffic flow vs. emergency response (optimize congestion vs. ambulance priority)

  7. AI Moderation: Content filtering decisions (tolerate vs. warning vs. permanent ban)

Each case includes:

  • Complete SSTF evaluation matrices

  • Layer-by-layer analysis (Ψ → Ω)

  • Φ calculation across stakeholders

  • T measurement and interpretation

  • Comparison: Without AEDA vs. With AEDA

Full manual (55-60 pages) available at: GitHub repository

Implementation & Access

Complete open access:

No registration, no restrictions, no attribution required. Use it, modify it, improve it.

Open Questions for the Community

I’m particularly interested in critical feedback. If you see fundamental flaws, I’d rather know now than after deployment.

1. SSTF Evaluation

  • What failure modes am I missing in the R-H-U framework?

  • Should thresholds be domain-specific? (Medical vs. financial vs. military?)

  • How to handle actions with high R but potentially enormous positive value? (Ex: permanent climate intervention that works)

2. Temporal Decay (Θ)

  • How should decay rate be calibrated for different contexts?

  • Fast decay (recent events dominate) vs. slow decay (long institutional memory)?

  • Can we formalize “appropriate” decay rates?

3. Systemic Coherence (Φ)

  • How to weight stakeholders of vastly different types? (Humans vs. ecosystems vs. future generations?)

  • Computational tractability: Φ integration for billions of agents?

  • How to handle stakeholders that can’t express preferences? (Animals, future people, AI systems?)

4. Attracteur Harmonique (η)

  • Who defines the “asymptotic ethical orientation”? (Consensus process? Emergent? Hardcoded?)

  • Can we have multiple η values that coexist? (Pluralistic ethics?)

  • How does η evolve as society’s values change?

5. Integration with Other Approaches

  • Can AEDA work with Constitutional AI? (Use constitutional principles as inputs to η?)

  • How to combine with IRL? (Inferred goals feed into AEO?)

  • Compatibility with debate/​amplification?

6. Computational Costs

  • What are the performance implications at scale?

  • Real-time SSTF evaluation for every action? (Latency concerns?)

  • Φ integration over large agent spaces? (Approximation techniques?)

7. Value Specification Problem

  • Does this actually address Stuart Armstrong’s concerns about goal specification?

  • Or just add a safety layer on top of mis-specified values?

  • Can SSTF + Φ + T compensate for imperfect value learning?

What AEDA Does NOT Solve

This is one piece of the alignment puzzle, not the whole solution.

It addresses: ✓ Catastrophic failure mode reduction
✓ Ethical coherence across contexts
✓ Context-adaptive decision-making
✓ Real-time drift detection

It does NOT solve: ✗ Value learning (what values to have in the first place)
✗ Inner alignment (mesa-optimizers)
✗ Corrigibility (accepting corrections gracefully)
✗ Deceptive alignment (AI pretending to be aligned)

Think of AEDA as: A structural safety framework that reduces X-risk from catastrophic optimization, not a complete theory of value alignment.

Call for Collaboration

All contributions welcome—anonymous or attributed. The goal is better AI safety, not credit or recognition.

If this seems useful: Please test it, break it, improve it. Fork the repo, propose modifications, identify failure modes.

If it’s fundamentally flawed: Please explain why so we can build something better.

Areas where help is needed:

  • Mathematical proofs of stability properties

  • Computational optimization (making Φ tractable)

  • Domain-specific threshold calibration

  • Integration with existing alignment approaches

  • Red-teaming (finding edge cases where AEDA fails)

“Ideas matter. Identity is optional.”

Note on Stuart Armstrong’s Work

@stuartarmstrong — Given your extensive work on value specification and goal misgeneralization (particularly in Smarter Than Us and subsequent papers), I’d be very interested in your thoughts on whether the SSTF + Φ approach addresses some of the failure modes you’ve documented.

Specifically:

  1. Does blocking actions with R≥0.8 or H≥0.8 pre-execution help prevent “literal interpretation disasters”?

  2. Can Systemic Coherence (Φ) address some of the “optimizer’s curse” problems?

  3. Does the Turbulence Index (T) provide useful real-time feedback on value drift?

  4. What failure modes am I missing in this framework?

The full technical documentation is available in the GitHub repository linked above. I’m particularly interested in identifying scenarios where this approach fails catastrophically.

Cross-posted to AI Alignment Forum

This framework is the result of extensive work on adaptive ethical architectures. All feedback—positive or negative—is valuable for improving AI safety.

## Note on AI Assistance This post was written with structural and formatting assistance from Claude AI (Anthropic). The framework, concepts, mathematical formalism, and case studies are human-authored. Claude helped organize the content for clarity and readability.

No comments.