Controlling AGI Risk

A theory of AGI safety based on constraints and affordances.

I’ve got this proto-idea of what’s missing in much public discussion and action on AI safety. I’m hoping that by sharing it here, the hive-mind might come together and turn it into something useful.

Effective control of AI risk requires a broader approach than those taken so far. Efforts to-date have largely gravitated into the two camps of value alignment and governance. Value alignment aims to design AI systems that reliably act in the best interest of humans. Governance efforts aim to constrain people who develop, deploy or use AI to do so in ways that ensure the AI doesn’t cause unacceptable harm.

These two camps are each and together necessary but insufficient to adequately control AI risk.

Firstly, AI capabilities emulate human cognitive capabilities. Their potential applications are so broad that their scope for application transcends all previous technologies. Most of the thinking and action-to-date on controlling AI risk has been based on how we’ve controlled the risks of previous technologies such as electricity, mechanized transport, and nuclear weapons. So far, we’ve mostly thought of AI as a technology to be used by humans, not as itself a user of technology.

Secondly, the acceleration of AI evolution is unlikely to decrease; the converse looks more likely, that increasingly capable and powerful AI will further accelerate the ongoing development of AI capability (including via self-improvement). Traditional governance mechanisms can’t keep pace with this and any value alignment of systems could be transcended by the next emergent system. Just as AI is likely to impact, interact with, and become embedded in the whole of society, whole-of-society risk control practices must evolve.

AI systems are already becoming embedded in large swathes of society.

It is likely that AGI will soon be here.

Control of risk from AGI needs to be as ubiquitous as control of risk from people.

Definitions:

Risk
The potential for unintended or undesirable loss or harm

AGI
Artificial General Intelligence: AI that can perform any intellectual task that a human can

Sociotechnical system/​s (STS)
A system in which agents (traditionally, people) interact with objects (including technologies) to achieve aims and fulfil purposes

Agent
An entity in an STS that makes decisions and initiates actions

Sensor
A mechanism via which an agent acquires data

Actuator
A mechanism via which an agent takes action

Technology
Any tool created by agents

Object
An entity in an STS that offers affordances to agents (includes but is not limited to technologies)

Affordance
The potential for action that an object offers to an agent

Constraint
An entity in a system that limits the interaction of an agent with an object

Axioms:

Historically, all risk in STS involves humans both as contributor to and recipient of harm because humans are an essential part of all STS.

STS scale from one person interacting with one piece of technology up to all people interacting with all technology.

STS are nested within larger STS up to the largest scale.

STS are complex systems; attributes of complex systems include non-determinism, some self-organisation, the potential for emergence, fuzzy boundaries.

Risk of harmful effects in STS arises from the same source as desirable effects in STS: agents interacting with objects.

Humans have historically been the sole agents in STS.

Our vast web of controls for risks in STS are premised on and target attributes of humans e.g. laws and their penalties, social conventions, financial incentives, physical barriers.

The prospect of jail, fines, or social approbation appear to be unreliable deterrents to AI.

Agents rely on sensors, actuators and associated signal pathways in order to act; these all offer opportunities to constrain action.

AI systems will be ubiquitously deployed.

AGI systems will be agents in STS.

AI attributes are different from human attributes.

Therefore, existing risk controls will be inadequate.

An entire new layer of AI risk controls must be added to and integrated with the entire STS, mirroring and synergising controls premised on human attributes, but accounting for AI attributes.

Context:

Agents interacting with objects capitalize on various affordances of those objects to support processes, usually in the pursuit of goals. For example, I (an agent) am currently utilizing the affordance offered by my office chair (an object/​technology) of sitting. Attributes of the chair that can interact with my own attributes to offer the affordance of sitting, include the convenient height of the seat surface above the floor, the pleasing contoured surface area of the seat that accommodates my butt, and the ergonomically designed backrest that stops me falling backwards. The process of sitting supports my goal of writing this post. However, my chair also offers a range of other affordances. It’s a swivel chair, so it offers me the affordance of spinning. It has wheels so I can roll it across the floor. It has enough mass that I could drop it from a height to cause damage to people or property.

Many objects in STS afford agents numerous kinds of processes, some desirable and intentional, others incidental and harmful. These latter can be called malaffordances—affordances that cause harm. Risk control relies on applying various constraints to these malaffordances to disrupt or modify either the attributes of the object, the potential interaction, or the action of the agent. Constraints exist on a spectrum between the hard and physical, like bolting my chair legs to the floor so I can’t drop if off the roof, and the soft and intentional, like social conventions and values that tell me that dropping a chair on someone’s head is not nice. Multiple constraints can be combined to create ‘defense-in-depth’ against the risk of harm. This is useful when the potential harm is significant and each risk control on its own has potential to fail. For example, to control risk from car accidents, we combine driver training, road rules, licensing, road infrastructure, vehicle design standards, etc.

Our evolved network of constraints to mitigate risk is distributed across all of our STS and is designed and deployed to influence people and groups of people at all levels. We rely on this network for our accustomed level of safety. AI has different attributes to people. New constraints are needed across all levels of our STS to account for the attributes of AI.

Of course, even if all of our tools, technologies and objects could be modified in ways that make them less prone to offer malaffordances to AI (AGI, ASI), it’s currently not practically or economically viable to do so. However, if we recognise the full scope of opportunity for risk control within STS, we may cast the net wide enough to build in enough defenses-in depth in sufficient time to be able to enjoy the benefits that AI is likely to bring.

Proposition:

That all elements of the STS be considered as a potential locus of control of AGI risk.

Potential for application:

This theory could be used initially to inform the design of governance and regulatory systems for AI. Subsequently, it could be used to inform and guide AGI risk control throughout societies.