AI alignment concepts: philosophical breakers, stoppers, and distorters

Meta: This is one of my ab­stract ex­is­ten­tial risk strat­egy con­cept posts that are de­signed to be about differ­ent per­spec­tives or foun­da­tions upon which to build fur­ther.


When think­ing about philos­o­phy one may en­counter philo­soph­i­cal break­ers, philo­soph­i­cal stop­pers, and philo­soph­i­cal dis­torters; thoughts or ideas that cause an agent (such as an AI) to break, get stuck, or take a ran­dom ac­tion. They are philo­soph­i­cal crises for that agent (and can in the­ory some­times be in­for­ma­tion haz­ards). For some less se­vere hu­man ex­am­ples, see this re­cent post on re­al­ity mask­ing puz­zles. In AI, ex­am­ple break­ers, stop­pers, and dis­torters are log­i­cal con­tra­dic­tions (in some sym­bolic AIs), in­abil­ity to gen­er­al­ize from ex­am­ples, and mesa op­ti­miz­ers, re­spec­tively.

Philo­soph­i­cal break­ers, stop­pers, and dis­torters all both pose pos­si­ble prob­lems and op­por­tu­ni­ties for build­ing safe and al­igned AGI and pre­vent­ing un­al­igned AGI from be­com­ing dan­ger­ous. The may be en­coun­tered or solved by ei­ther ex­plicit philos­o­phy, im­plic­itly as part of de­vel­op­ing an­other field (like math­e­mat­ics or AI), by ac­ci­dent, or by trial and er­ror. An aware­ness of the idea of philo­soph­i­cal break­ers, stop­pers, and dis­torters pro­vides an­other com­ple­men­tary per­spec­tive for solv­ing AGI safety and may prompt the gen­er­a­tion of new safety strate­gies and AGI de­signs (see also, this com­ple­men­tary strat­egy post on safety reg­u­la­tors).

Con­cept definitions

Philo­soph­i­cal break­ers:

  • Philo­soph­i­cal thoughts and ques­tions that cause an agent to break or oth­er­wise take a lot of dam­age that are hard to an­ti­ci­pate be­fore­hand for that agent.

Philo­soph­i­cal stop­pers:

  • Philo­soph­i­cal thoughts and ques­tions that cause an agent to get stuck in an im­por­tant way that are hard to an­ti­ci­pate be­fore­hand for that agent.

Philo­soph­i­cal dis­torters:

  • Philo­soph­i­cal thoughts and ques­tions that cause an agent to choose a ran­dom or changed philo­soph­i­cal an­swer than the one it was us­ing (pos­si­bly im­plic­itly) ear­lier. An ex­am­ple in the field of AGI al­ign­ment would be some­thing that causes an al­igned AGI to in some sense ran­domly choose it’s util­ity func­tion to be pa­per­clip max­i­miz­ing be­cause of an on­tolog­i­cal crisis.

Con­cepts pro­vid­ing con­text, gen­er­al­iza­tion, and contrast

Thought break­ers, stop­pers, and dis­torters:

  • Gen­er­al­iza­tions of their philo­soph­i­cal ver­sions that cov­ers thoughts and ques­tions in gen­eral, like a thought that would cause an agent to halt, im­ple­ment­ing al­gorithms in buggy ways, deep med­i­ta­tive re­al­iza­tions, self-re­pro­gram­ming that causes un­ex­pected failures, get­ting stuck in thought loop… that are hard to an­ti­ci­pate be­fore­hand for that agent.

Sys­tem break­ers, stop­pers, and dis­torters:

  • A fur­ther gen­er­al­iza­tion that also in­cludes sys­tem en­vi­ron­ment and ar­chi­tec­ture prob­lems. For in­stance, sys­tem en­vi­ron­ments could be full of hack­ers, noisy, or ad­ver­sar­ial ex­am­ples and the ar­chi­tec­ture could in­volve ge­netic al­gorithms.

Threats vs break­ers, stop­pers, and dis­torters:

  • Gen­er­al­iza­tions of break­ers, stop­pers, and dis­torters to in­clude those things that are easy to an­ti­ci­pate be­fore­hand for that agent.

View­points: The agent’s view­point and an ex­ter­nal view­point.

Ap­pli­ca­tion domains

The nat­u­ral places to use these con­cepts are philo­soph­i­cal in­quiry, the philo­soph­i­cal parts of math­e­mat­ics or physics, and AGI al­ign­ment.

Con­cept consequences

If there is a philo­soph­i­cal breaker or stop­per for an AGI when un­der­go­ing self-im­prove­ment into a su­per­in­tel­li­gence, and it isn’t a prob­lem for hu­mans or it’s one that we’ve already passed through, then by not disarm­ing it for that AGI we are leav­ing a bar­rier in place for its de­vel­op­ment (a triv­ial ex­am­ple of this is gen­eral in­tel­li­gence isn’t a prob­lem for hu­mans). This can be thought of as a safety method. Such prob­lems can be ei­ther nat­u­rally found as con­se­quences of an AGI de­sign or an AGI may be de­signed to en­counter them if it un­der­goes au­tonomous self-im­prove­ment.

If there is a philo­soph­i­cal dis­torter in front of a safe and al­igned AGI, we’ll need to disarm it ei­ther by chang­ing the AGI’s code/​ar­chi­tec­ture or mak­ing the AGI aware of it in a way such that it can avoid it. We could, for in­stance, hard code an an­swer or we could point out some philo­soph­i­cal in­ves­ti­ga­tions as things to avoid un­til it is more so­phis­ti­cated.

How ca­pa­ble an agent may be­come and how fast it reaches that ca­pa­bil­ity will par­tially de­pend on the philo­soph­i­cal break­ers and stop­pers it en­coun­ters. If the agent has a bet­ter abil­ity to search for and disarm them then it can go fur­ther with­out break­ing or stop­ping.

How safe and al­igned an agent is will par­tially be a func­tion of the philo­soph­i­cal dis­torters it en­coun­ters (which in turn par­tially de­pends on its abil­ity to search for them and disarm them).

Many philo­soph­i­cal break­ers and stop­pers are also philo­soph­i­cal dis­torters. For in­stance if a sys­tem gets stuck in gen­er­al­iz­ing be­yond a point, it may rely on evolu­tion in­stead. In this case we must think more care­fully about disarm­ing philo­soph­i­cal break­ers and stop­pers. If a safe and al­igned AGI en­coun­ters a philo­soph­i­cal dis­torter, it is prob­a­bly not safe and al­igned any­more. But if an un­al­igned AGI en­coun­ters a philo­soph­i­cal stop­per or breaker, it may be pre­vented from go­ing fur­ther. In some sense, an AGI can­not ever be fully safe and al­igned, if it will, upon au­tonomous self-im­prove­ment, en­counter a philo­soph­i­cal dis­torter.

A pro­posed gen­eral AGI safety strat­egy with re­spect to philo­soph­i­cal break­ers, stop­pers, and dis­torters:

  1. First, de­sign and im­ple­ment a safe and al­igned AGI (safe up to resi­d­ual philo­soph­i­cal dis­torters). If the AGI isn’t safe and al­igned, then pro­ceed no fur­ther un­til you have one that is.

  2. Then, re­move philo­soph­i­cal dis­torters that are not philo­soph­i­cal break­ers or stoppers

  3. Then, re­move philo­soph­i­cal dis­torters that are philo­soph­i­cal break­ers or stoppers

  4. And fi­nally, re­move philo­soph­i­cal break­ers and stoppers