A general model of safety-oriented AI development

This may be trivial or obvious for a lot of people, but it doesn’t seem like anyone has bothered to write it down (or I haven’t looked hard enough). It started out as a generalization of Paul Christiano’s IDA, but also covers things like safe recursive self-improvement.

Start with a team of one or more humans (researchers, programmers, trainers, and/or overseers), with access to zero or more AIs (initially as assistants). The human/AI team in each round develops a new AI and adds it to the team, and repeats this until maturity in AI technology is achieved. Safety/alignment is ensured by having some set of safety/alignment properties on the team that is inductively maintained by the development process.

The reason I started thinking in this direction is that Paul’s approach seemed very hard to knock down, because any time a flaw or difficulty is pointed out or someone expresses skepticism on some technique that it uses or the overall safety invariant, there’s always a list of other techniques or invariants that could be substituted in for that part (sometimes in my own brain as I tried to criticize some part of it). Eventually I realized this shouldn’t be surprising because IDA is an instance of this more general model of safety-oriented AI development, so there are bound to be many points near it in the space of possible safety-oriented AI development practices. (Again, this may already be obvious to others including Paul, and in their minds IDA is perhaps already a cluster of possible development practices consisting of the most promising safety techniques and invariants, rather than a single point.)

If this model turns out not to have been written down before, perhaps it should be assigned a name, like Iterated Safety-Invariant AI-Assisted AI Development, or something pithier?