«Boundaries» for formalizing an MVP morality

Update: For a better exposition of the same core idea, see Agent membranes and formalizing “safety”.


Here’s one specific way that «boundaries» could directly apply to AI safety.

For context on what “«boundaries»” are, see «Boundaries» and AI safety compilation (also see the tag for this concept).

In this post, I will focus on the way Davidad conceives using «boundaries» within his Open Agency Architecture safety paradigm. Essentially, Davidad’s hope is that «boundaries» can be used to formalize a sort of MVP morality for the first AI systems.

Update: Davidad left a comment endorsing this post. He also later tweeted about it in a twitter reply.[1]

Why «boundaries»?

So, in an ideal future, we would get CEV alignment in the first AGI.

However, this seems really hard, and it might be easier to get AI x-risk off the table first (thus ending the “acute risk period”), and then figure out how to do the rest of alignment later.[2]

In which case, we don’t actually need the first AGI to understand all of human values/​ethics— we only need it to understand a minimum subset that ensures safety.

But which subset? And how could it be formalized in a consistent manner?

This is where the concept of «boundaries» comes in, because the concept has two nice properties:

  1. «boundaries» seem to explain what’s bad about a bunch of actions that are otherwise difficult to explain why they’re bad.

  2. «boundaries» seem possible to formalize algorithmically

The hope, then, is that the «boundaries» concept could be formalized into a sort of MVP morality that could be used in the first AI system(s).

Concretely, one way Davidad envisions implementing «boundaries» is by tasking an AI system to minimize the occurrence of ~objective «boundary» violations for its citizens.

That said, I disagree with such an implementation and I will propose an alternative in another post.

Also related: Acausal normalcy

Quotes from Davidad that support this view

(All bolding below is mine.)

Davidad tweeted in 2022 Aug:

Post-acute-risk-period, I think there ought to be a “night watchman Singleton”: an AGI which technically satisfies Bostrom’s definition of a Singleton, but which does no more and no less than ensuring a baseline level of security for its citizens (which may include humans & AIs).

next tweet:

If and only if a night-watchman singleton is in place, then everyone can have their own AI if they want. The night-watchman will ensure they can’t go to war. The price of this is that if the night-watchman ever suffers a robustness failure it’s game

later in the thread:

The utility function of a night-watchman singleton is the minimum over all citizens of the extent to which their «boundaries» are violated (with violations being negative and no violations being zero) and the extent to which they fall short of baseline access to natural resources

Davidad in AI Neorealism: a threat model & success criterion for existential safety (2022 Dec):

For me the core question of existential safety is this:

It is not, for example, “how can we build an AI that is aligned with human values, including all that is good and beautiful?” or “how can we build an AI that optimises the world for whatever the operators actually specified?” Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).

Davidad in An Open Agency Architecture for Safe Transformative AI (2022 Dec):

  • Deontic Sufficiency Hypothesis: There exists a human-understandable set of features of finite trajectories in such a world-model, taking values in , such that we can be reasonably confident that all these features being near 0 implies high probability of existential safety, and such that saturating them at 0 is feasible[2] with high probability, using scientifically-accessible technologies.

Also see this tweet from Davidad in 2023 Feb:

In the situation where new powerful AIs with alien minds may arise (if not just between humans), I believe that a “night watchman” which can credibly threaten force is necessary, although perhaps all it should do is to defend such boundaries (including those of aggressors).

Further explanation of the OAA’s Deontic Sufficiency Hypothesis in Davidad’s Bold Plan for Alignment: An In-Depth Explanation (2023 Apr) by Charbel-Raphaël and Gabin:

Getting traction on the deontic feasibility hypothesis

Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don’t die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don’t die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.

Also:

  • (*) Elicitors: Language models assist humans in expressing their desires using the formal language of the world model. […] Davidad proposes to represent most of these desiderata as violations of Markov blankets. Most of those desiderata are formulated as negative constraints because we just want to avoid a catastrophe, not solve the full value problem. But some of the desiderata will represent the pivotal process that we want the model to accomplish.

(The post also explains that the “(*)” prefix means “Important”, as distinct from “not essential”.)

This comment by Davidad (2023 Jan):

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.

From Reframing inner alignment by Davidad (2022 Dec):

I’m also excited about Boundaries as a tool for specifying a core safety property to model-check policies against—one which would imply (at least) nonfatality—relative to alien and shifting predictive ontologies.

From A list of core AI safety problems and how I hope to solve them (2023 Aug):

9. Humans cannot be first-class parties to a superintelligence values handshake.

[…]

OAA Solution: (9.1) Instead of becoming parties to a values handshake, keep superintelligent capabilities in a box and only extract plans that solve bounded tasks for finite time horizons and verifiably satisfy safety criteria that include not violating the natural boundaries of humans. This can all work without humans ever being terminally valued by AI systems as ends in themselves.

  1. ^

    FWIW he left this comment before I simplified this post a lot on 2023 Sept 15.

  2. ^