Master plan spec: needs audit (logic and cooperative AI)

TLDR: I think synthesizing bargaining protocols with pleasing developer experience, high quality “guarantees” (in the formal verification sense), and low cost will help us a lot around a multipolar takeoff, and I’d like to submit my worldview for a community audit.

Tremendous thanks to the attendees of Alignable Structures, I cultivated the self-confidence I needed to dive into this research agenda as a direct result of the vibes that weekend.

This document does not endeavor to be a thorough contribution in itself, it intends to be written quickly in spare moments between papers and textbooks. It should be clear to you what I’ve read, and I’m seeking your help to prioritize what else I have to read in the prioritization /​ theory of change /​ inside view group of reading.

  • First, I motivate: I express some takeoff beliefs. If you already think that multipolar scenarios are important, or you’re not interested in changing my mind about this, skip this section.

  • Then, I pontificate: what products would we like to be on the market at crunch time?

  • Subsequently, I review: what progress has been made, on whose shoulders am I standing? This is the object-level technical section.

  • Finally, I redteam, though I hope to be outperformed by the commenters.

In the interest of keeping the post short, I’m omitting many details about why my worldview is where it is, so I encourage you to prod me about specific things you’d like to see elaboration on in the comments.

Takeoff geometry

Claim: I consider takeoff as a unit cube in three dimensions. Research agendas ought to clearly locate themselves, because theories of change are betting on which world we’ll end up in.

I omit continuity, because I can never think of any actionable insight that would result from a honed forecast of that one. Arguably, agents vs. services is another axis to play with, but perhaps another post. So you may think a unit cube in 4 or 5 dimensions is more appropriate or complete.

-polarity

Takeoff comes in unipolar and multipolar flavors. When I say a scenario is strongly unipolar, I mean it consists mostly of the classical yudkowsky-bostrom content, i.e. recursive self improvement leading to clear dominance by one agent, regardless of any alignment properties it lacks, which we call a singleton. The strongly multipolar scenario is very chaotic, with a ton of AIs flying around.

-geneity

Takeoff comes in homogenous and heterogenous flavors. In the original post, Evan defines the terms:

how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get a heterogenous takeoff.

And forecasts homogenous takeoff.

speed

Takeoff can be fast or slow. I set the extremely fast end at ramping from wherever we are now (end of 2022) to transformative AI by 2025, and the extremely slow end at ramping up from wherever we are now to transformative AI by 2055, which is Ajeya2020’s 60% number plus 5 years.

Forecasting by giving a solid

I give a table of the sheets at the extreme end of each scenario type. I.e., the sheet (-1, y, z) corresponds to a high probability of strongly unipolar takeoff. Implicitly, there’s a background notion of transformation into multiple meanings, where I presume that a lower probability of a stronger unipolar takeoff corresponds with a higher probability of a weaker unipolar takeoff, and so on.

Sheet of unit cubeTakeoff scenario type
(-1, y, z)unipolar
(1, y, z)multipolar
(x, −1, z)homogenous
(x, 1, z)heterogenous
(x, y, −1)fast
(x, y, 1)slow


Then, I can eyeball the class of forecasts of scenarios for which I expect my stated research goals to be helpful as none other than the set

In other words,

  • I think strongly unipolar takeoff isn’t likely at all

  • I think we’re quite likely indeed to have a very multipolar scenario

  • I expect the -geneity axis to avoid extremes, but besides that I’m not terribly opinionated. In the interest of keeping the post short, I’m omitting details about my divergence from Evan.

  • The confidence interval I’d like to work with for TAI is 12-29 years from now (2035 to 2052)

Multi-multi delegation

I really like Critch & Krueger 2020, in which “alignment” is problematized precisely because “aligning multiple agents to multiple principals” is an incredibly confused sentence. Delegation (which specifically means comprehension, instruction, and control) is an improvement

Claim: Most of the movement’s research portfolio is in single-single delegation

Claim: Single-single is a critical and difficult warmup task, but multi-multi is clearly the real thing

The research portfolio also needs infrastructure to build out connections between fields. In particular, I’m excited about interp (like searching for search) and agent foundations (like (A → B) → A) finding common ground (specifically a feedback loop of observing terms and forecasting types), and I mainly expect cooperative AI to factor in because I don’t think there’s a reasonable story you can tell about agency that does not account for all the other agents in the world. To pick on agent foundations in particular: the spherical elephant in the room has gotta be that the field focuses on one agent at a time, like it’s alone on a planet playing factorio. I feel like that’s a critical assumption in embedded agency research and things like that.

TLDR: what I broadly think crunch time looks like

I expect that crunch time looks like 5-50 institutions, competing and trading with one another. I’m uncertain if there will be runaway advantages such that unipolar-like dominance eventually emerges, so it’s plausible that multipolar liftoff is a temporary state, but in that scenario, crucially, our collective capacities for bargaining, social choice, trading, etc. will be an extremely influential variable (one we can wiggle now!) on how that eventual dominance ends up.

What products would we like to be on the market at crunch time?

I’m painting very broadly about scenarios I see emerging from about 2028 to TAI. The obvious hope is that my product ideas are robust to wide error bars!

ScenarioA nice-to-have product for that scenarioWhat needs to be in place for that product to be viable
The boards of corporations A and B are negotiating a merger, and have delegated most of the work to an executive assistant service. Synthetic bargaining subroutines with free assurances that the services will act in a predictable mannerProof assistant -like tooling honed for properties of agents.
The leaders of country X are threatening to invade country Z, and country Y is privately committed to defending Z. Civilian and military leadership of X and Y each are augmented by assistants which are plugged into autonomous weapons systems.A specification language to describe the stakes of the scenario (very loosely) as a “game” and a calculator to recommend policies for scenarios that abide by the specification language.A semantic account of games and strategies that plays nicely with algorithmic game theory.
An agent is the custodian of a small civilization of ems or digital minds, which values purple hats. A DAO populated by human principals would like to utilize that civilization’s computronium, which they’re willing to play along with if members of the DAO agree to wear lots of purple hats. The agent is simulating the DAO at some fixed resolution, and the agent’s source code is known to the DAO. A tooling stack for taking a bargaining subroutine and generating audits, interpretations, and assurances about how that subroutine will behave (this would build up confidence between the two parties). Meaningful asks for what an audit of a bargaining subroutine should reveal, theoretical foundations for useful interpretation, etc.


At least two of these scenarios seem explicitly CAIS-like, so you may be wondering if I even believe in agency. My answer is that I think CAIS-like scenarios necessarily precede agents-running-amok scenarios, and furthermore, we don’t have actionable information right now about what to do once agents start running amok. In other words, I’m targeting my interventions toward crunch time, not toward glorious transhuman future or death. The prologue to an agent-based takeoff, if one is to occur, will provide us with better information about how to deal.

TLDR, it seems like a path forward for interventions in this class of scenarios is to build expertise around every logical approach to game theory. Questions I have like “what programming language’s interpretation is mixed strategies?” or “what would it feel like to write programs in a programming language whose terms are mixed strategies?” lead me to tentatively calling this “semantic game theory”, though I suspect the open games literature (see below) may be a few steps ahead of me.

Object-level technical content

On whose shoulders are we standing?

Program equilibrium and modal combat

Open source game theory (OSGT) (that is, game theory when you can read each other’s source code), turns out to be an application of modal logic.

Compositional game theory /​ open games

The applied category theory community provides a fully categorical story of arbitrary scenarios from classical game theory. Game theorists of the alignment community don’t appear to have pursued understanding of this story, unless I’ve missed something.

Domain theory

Nisan identified domain theory as a direction for getting results in OSGT with recursive beliefs.

Plausibly helpful directions not explicitly a part of my master plan right now

  • Add open source to open games

  • Give a formalization of “open source to a matter of degree” or partially translucent source code. Can we prepare for scenarios where an AI is simulating their opponent to only 60% of the resolution needed for the open source results to kick in? Can we preserve any neat bargaining properties under sensitive/​private codebases?

  • Beyond game theory to social choice theory

  • Deploy the zero-knowledge literature to enable stakeholders to credibly signal the game-theoretic properties of their software even if the actual implementation is completely sensitive/​private.

Projects I’m in early stages of

  • Syntax, operational semantics, and denotational semantics of a programming language for specifying or reasoning about games, players, and strategies. This would build the foundation for a proof assistant that game theorists could use for propositions that sound like “do mixed strategies (x,y) form a nash equilibrium for game G?” (see probabilistic semantics), which would be a strong foundation to go onto calculators that can answer questions like “against player x on game G, what strategy ought you play?” (see algorithmic game theory), which if built on a type-theoretic foundation would lend at least a little bit to player synthesis that could write programs of guaranteed correctness with respect to cooperative properties you’re interested in. (Along the way, I would also try OSGT proof assistants).

  • Probing my curiosity about combinations of modal logic, bisimulation, and simplicial complexes, just to verify that the modal combat people didn’t miss anything useful in that confluence.

  • Checking if anything is useful in the story of open games, and if it is, writing a graphical proof assistant for string diagrams. This would go hand in hand with some of the benefits of the tooling in the first bullet point

  • Executable textbook for game theory.

I’m also putting a couple days into stuff that’s more wentworth or garrabrant style, and I plan to take a week off sometime after christmas to see how many preliminary sketches I can make for a heuristic argument assistant tool stack.

Redteam

  1. My follow-through, I’m a bad employee in terms of consistency and dependability, and much of that would apply to independent research: I kick ass for stretches then crash (3 to 5 months asskicking per one month burned out).

  2. I’m sniped by the areas of math I’m most aesthetically attracted to, and creating a 300 IQ plan with a bajillion 4D chess moves to rationalize working on them.

  3. My motivating takeoff scenarios are too specific, and my research outputs won’t be robust to the actual error bars. TODO: think more about what cooperative AI research’s value prop looks like in strongly unipolar scenarios.

  4. I couldn’t actually get users to deploy synthesized bargaining protocols to prod.

  5. There is an easier way than what I suggest to get auditable, interpretable, and reliable interaction mechanisms between AIs.