A gen­eral model of safety-ori­ented AI development

This may be trivial or ob­vi­ous for a lot of people, but it doesn’t seem like any­one has bothered to write it down (or I haven’t looked hard enough). It star­ted out as a gen­er­al­iz­a­tion of Paul Chris­ti­ano’s IDA, but also cov­ers things like safe re­curs­ive self-im­prove­ment.

Start with a team of one or more hu­mans (re­search­ers, pro­gram­mers, train­ers, and/​or over­seers), with ac­cess to zero or more AIs (ini­tially as as­sist­ants). The hu­man/​AI team in each round de­vel­ops a new AI and adds it to the team, and re­peats this un­til ma­tur­ity in AI tech­no­logy is achieved. Safety/​align­ment is en­sured by hav­ing some set of safety/​align­ment prop­er­ties on the team that is in­duct­ively main­tained by the de­vel­op­ment pro­cess.

The reason I star­ted think­ing in this dir­ec­tion is that Paul’s ap­proach seemed very hard to knock down, be­cause any time a flaw or dif­fi­culty is poin­ted out or someone ex­presses skep­ti­cism on some tech­nique that it uses or the over­all safety in­vari­ant, there’s al­ways a list of other tech­niques or in­vari­ants that could be sub­sti­tuted in for that part (some­times in my own brain as I tried to cri­ti­cize some part of it). Even­tu­ally I real­ized this shouldn’t be sur­pris­ing be­cause IDA is an in­stance of this more gen­eral model of safety-ori­ented AI de­vel­op­ment, so there are bound to be many points near it in the space of pos­sible safety-ori­ented AI de­vel­op­ment prac­tices. (Again, this may already be ob­vi­ous to oth­ers in­clud­ing Paul, and in their minds IDA is per­haps already a cluster of pos­sible de­vel­op­ment prac­tices con­sist­ing of the most prom­ising safety tech­niques and in­vari­ants, rather than a single point.)

If this model turns out not to have been writ­ten down be­fore, per­haps it should be as­signed a name, like Iter­ated Safety-In­vari­ant AI-Ass­isted AI Devel­op­ment, or some­thing pith­ier?