Ethos: A Shared Language for Portable AI Behavioral Evidence (RFC)

This project started as me trying to make a little toy game to help show the public certain Goodhart scenarios, but it’s grown into something different which I hope is a useful contribution to the field. This proposal is a starting point, it is far from complete, but IO hope it is a starting point for greater collaboration across labs, countries, etc. I do believe if we’re going to figure out something like alignment, perhaps we need to align ourselves a bit first.… Any feedback is much appreciated.

Credit note: The core concepts and insights are mine. Drafting and research/​citations I credit to Opus 4.5 and GPT5.2 Pro with red teaming by Gemini Pro.

This problem I saw:

When Lab A says their agent is “safe,” Lab B can’t verify it. Regulators can’t read it. The definitions don’t match. There’s no shared vocabulary, no standard format, no adversarial verification. Everyone’s speaking different languages. I felt this when I just couldn’t get how one report from one lab meant anything compared to the other.

The core insight I had: Behavioral descriptions transfer the same way human action descriptions do. “She helped him” means something in London and Tokyo. “ATTACK.RESOURCE.STEAL” can mean something across gridworlds and language model deployments. An infinite D&D-like training environment or a Matrix-game world with an ASI roleplayer can map decisions in similar words, it’s all language. From “deceive, attack nuclear bunker, strategic, angry” in cyberspace to “respectfully punch human in face” in a robot, language is the common denominator of all human thought, and of course the basis of our most cutting edge AI systems. So why don’t we use a structured language ontology as a kind of “spine” to classify AI behavioral profiles so we can test them in a lot of different environments in a way that is reproducible, doesn’t reveal IP, and can be audited?

This paper proposes such a possible infrastructure, not alignment. Think “PDF for AI behavior” not a solution, but a transport layer that makes solutions comparable. This is a system spec I am planning to publish completely open-source in the next few days. If anyone would like to collaborate more actively before then I welcome any help.

I’m proposing:

  • Ethos Lexicon: Shared vocabulary with adversarial claim/​check verification

  • Ethos Dossier: Signed evidence bundles that travel between orgs

  • Ethos Crucible: Reference environment with canonical failure modes

  • Staged Curriculum: Earn trust progressively, multiple independent evaluators

DRAFT PAPER FOR YOUR REVIEW

What I’d appreciate from you:

>>> Try to break the ontology. What actions collapse when you leave gridworld?

>>> Is “alignment delta” (stated vs revealed intent) a useful deception proxy or noise?

>>> What would make a frontier lab actually adopt this?

>>> What’s missing?