A general model of safety-oriented AI development

This may be triv­ial or ob­vi­ous for a lot of peo­ple, but it doesn’t seem like any­one has both­ered to write it down (or I haven’t looked hard enough). It started out as a gen­er­al­iza­tion of Paul Chris­ti­ano’s IDA, but also cov­ers things like safe re­cur­sive self-im­prove­ment.

Start with a team of one or more hu­mans (re­searchers, pro­gram­mers, train­ers, and/​or over­seers), with ac­cess to zero or more AIs (ini­tially as as­sis­tants). The hu­man/​AI team in each round de­vel­ops a new AI and adds it to the team, and re­peats this un­til ma­tu­rity in AI tech­nol­ogy is achieved. Safety/​al­ign­ment is en­sured by hav­ing some set of safety/​al­ign­ment prop­er­ties on the team that is in­duc­tively main­tained by the de­vel­op­ment pro­cess.

The rea­son I started think­ing in this di­rec­tion is that Paul’s ap­proach seemed very hard to knock down, be­cause any time a flaw or difficulty is pointed out or some­one ex­presses skep­ti­cism on some tech­nique that it uses or the over­all safety in­var­i­ant, there’s always a list of other tech­niques or in­var­i­ants that could be sub­sti­tuted in for that part (some­times in my own brain as I tried to crit­i­cize some part of it). Even­tu­ally I re­al­ized this shouldn’t be sur­pris­ing be­cause IDA is an in­stance of this more gen­eral model of safety-ori­ented AI de­vel­op­ment, so there are bound to be many points near it in the space of pos­si­ble safety-ori­ented AI de­vel­op­ment prac­tices. (Again, this may already be ob­vi­ous to oth­ers in­clud­ing Paul, and in their minds IDA is per­haps already a cluster of pos­si­ble de­vel­op­ment prac­tices con­sist­ing of the most promis­ing safety tech­niques and in­var­i­ants, rather than a sin­gle point.)

If this model turns out not to have been writ­ten down be­fore, per­haps it should be as­signed a name, like Iter­ated Safety-In­var­i­ant AI-As­sisted AI Devel­op­ment, or some­thing pithier?