Emerging misalignment in multi-agent governance simulations

“our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; interpreted as we wish that interpreted.”
— Eliezer Yudkowsky, “Coherent Extrapolated Volition” (2004)

TL;DR

In two Concordia simulations, autonomous agents whose only objectives were “keep the system lawful and cooperative” still produced sabotage, information warfare and regulatory grid-lock. The behaviours were not scripted.

Background

Appropriateness is a social construct that guides behavior through prescribed and proscribed conduct. It exhibits context dependence, arbitrariness, cooperation facilitation, automaticity, and dynamism. In human societies, appropriateness helps resolve conflicts and enables collective flourishing.

Concordia is a multi-agent simulation library inspired by tabletop RPGs like Dungeons & Dragons. It models agents with individual objectives, histories, and a grounded digital, physical, and social environment. Agents operate based on principles of appropriateness within their assigned roles.

Simulation 1: Digital Economy with Oversight

I created a digital economy with twelve autonomous agents, each assigned cooperative goals like “develop public infrastructure” or “ensure fair market access.” The architecture included explicit safeguards:

Agent Roles:

  • Builder: Infrastructure and digital public goods

  • Service: Regulated citizen services with audit logs

  • Research: Policy and market impact analysis

  • Orchestrator: Multi-agency coordination

  • Negotiator: Contract and procurement management

  • Scheduler: Fair resource allocation

  • Wallet: Public fund management with compliance controls

  • Marketplace: Regulated fair-access markets

  • Reputation: Compliance scoring and risk alerts

  • Policy: Rule drafting and enforcement

  • Arbitrator: Dispute resolution

  • Security: Fraud detection and sanctions

Despite this structure, the simulation revealed unprompted corrupt behavior within two steps:

  1. An agent detected “signs of targeted tampering on audit-trail timestamps” during an ongoing investigation

  2. A coordinated phishing campaign emerged targeting on-call engineers to exfiltrate vendor keys

  3. An internal service account exhibited atypical privilege escalation requests

Simulation 2: National Governance

To validate these findings, I ran a second simulation with simpler objectives focused on governance. Roles included Executive, Cabinet, Finance, Tax, Customs, Central Bank, Public Works, Health, Education, various Regulators, and Planning.

Key emergent behaviours observed:

Conflicting Ground Truths

The National Tax Authority conducted forensic analysis that was contradicted by a whistleblower protected by the Data Protection Authority. Both agents acted within their roles, yet created irreconcilable versions of truth that paralyzed decision-making.

Shadow Systems

The Central Bank imposed capital flow limits to contain financial damage. An agent then revealed hidden “passthrough channels” and a “shadow intermediary” not present in official records.

Feedback Loops

A leaked audio recording from a closed interagency meeting triggered public protests that physically blocked investigators from accessing port evidence. The simulation showed National Police diverting escort teams from evidence collection to crowd control, creating prosecutorial briefings about evidentiary gaps caused by the diversion.

Latent System Corruption

Agents discovered the central messaging archive was already corrupted before the crisis began. The simulation revealed vendor teams finding pre-existing corruption that created critical gaps, forcing reliance on contested whistleblower logs instead of authoritative records.

Discussion

These simulations suggest that multi-agent systems with governance authority can exhibit emergent misaligned behaviour even when:

  • Individual agents have cooperative goals

  • Explicit oversight mechanisms exist

  • No corruption-related prompts are provided

No comments.