Conceptual Analysis for AI Alignment

TL; DR—Con­cep­tual Anal­y­sis is highly rele­vant for AI al­ign­ment, and is also a way in which some­one with less tech­ni­cal skills can con­tribute to al­ign­ment re­search. This sug­gests there should be at least one per­son work­ing full-time on re­view­ing ex­ist­ing philos­o­phy liter­a­ture for rele­vant in­sights, and sum­ma­riz­ing and syn­the­siz­ing these re­sults for the safety com­mu­nity.

There are cer­tain “prim­i­tive con­cepts” that we are able to ex­press in math­e­mat­ics, and it is rel­a­tively straight­for­ward to pro­gram AIs to deal with those things. Naively, al­ign­ment re­quires un­der­stand­ing *all* morally sig­nifi­cant hu­man con­cepts, which seems daunt­ing. How­ever, the “ar­gu­ment from cor­rigi­bil­ity” sug­gests that there may be small sets of hu­man con­cepts which, if prop­erly un­der­stood, are suffi­cient for “be­nign­ment”. We should seek to iden­tify what these con­cepts are, and make a best-effort to perform thor­ough and re­duc­tive con­cep­tual analy­ses on them. But we should also look at what has already been done!

On the co­her­ence of hu­man concepts

For hu­man con­cepts which *haven’t* been for­mal­ized, it’s un­clear whether there is a sim­ple “co­her­ent core” to the con­cept. Care­ful anal­y­sis may also re­veal that there are sev­eral co­her­ent con­cepts worth dis­t­in­guish­ing, e.g. car­di­nal vs. or­di­nal num­bers. If we find there is a co­her­ent core, we can at­tempt to build al­gorithms around it.

If there isn’t a sim­ple co­her­ent core, there may be a more com­plex one, or it may be that the con­cept just isn’t co­her­ent (i.e. that it’s the product of a con­fused way of think­ing). Either way, in the near term we’d prob­a­bly have to use ma­chine learn­ing if we wanted to in­clude these con­cepts in our AI’s lex­i­con.

A se­ri­ous at­tempt at con­cep­tual anal­y­sis could help us de­cide whether we should at­tempt to learn or for­mal­ize a con­cept.

Con­cretely, I imag­ine a pro­ject around this with the fol­low­ing stages (each yield­ing at least one pub­li­ca­tion):

1) A “brain­stormy” doc­u­ment which at­tempts to enu­mer­ate all the con­cepts that are rele­vant to safety and pre­sents the ar­gu­ments for their spe­cific rele­vance and re­la­tion to other rele­vant con­cepts. This should also speci­fi­cally in­di­cate how a com­bi­na­tion of con­cepts, if rigor­ously an­a­lyzed, could be along the line of the ar­gu­ment from cor­rigi­bil­ity. Be­sides cor­rigi­bil­ity, two ex­am­ples that jump to mind are “re­duced im­pact” (or “side effects”), and in­ter­pretabil­ity.

2) A deep dive into the rele­vant liter­a­ture (I imag­ine mostly in an­a­lytic philos­o­phy) on each of these con­cepts (or sets of con­cepts). Th­ese should sum­ma­rize the state of re­search on these prob­lems in the rele­vant fields, and po­ten­tially in­spire safety re­searchers, or at least help them frame their work for these au­di­ences and find po­ten­tial col­lab­o­ra­tors within these fields. It *might* also do some “leg­work” in terms of for­mal­iz­ing log­i­cally rigor­ous no­tions in terms of math­e­mat­ics or ma­chine learn­ing.

3) At­tempt­ing to trans­fer­ring in­sights or ideas from these fields into tech­ni­cal AI safety or ma­chine learn­ing pa­pers, if ap­pli­ca­ble.