the Insulated Goal-Program idea

Link post

(this post has been written for the first Refine blog post day, at the end of the week of readings, discussions, and exercises about epistemology for doing good conceptual research)

the Insulated Goal-Program idea is a framework for AI alignment which feels more potentially tractable than most other ideas i’ve seen.

it splits the task of building aligned AI into two parts:

  1. building a very intelligent AI which, when running, will have the axiomatic goal of running a program, which we’ll call goal-program

  2. building said goal-program, such that when ran, it hopefully creates valuable outcomes

the fact that the AI’s goal is to run a program, whose functioning it is motivated to run without altering it, lets us design a goal-program that doesn’t have to deal with an adverse optimizing superintelligence — it is insulated from the AI’s choices.

(or at least, there’s supposedly no reason for the AI to run long stretches of variants of that program, because of the computational cost for supposedly no gain)

one way to insulate the goal-program is to make it fully deterministic. ideally, however, we would want it to be able to receive as input the state of the world before the AI modifies the world — which it will pretty much inevitably do, destroying everything and tiling the universe with computronium dedicated to running the goal-program.

this is how this idea solves the “facebook AI destroys the world six months later” problem: the AI will run the goal-program at any cost, including turning everything that exists into computronium.

but that’s okay: the point here is for us, or at least our values, to survive inside the goal-program. that is the bullet i bite to allow this idea to function: i give up on the literal physical world around us, in the hopes that we’re satisfied enough with getting to determine what it is that runs on the computronium that everything is turned into.

making the goal-program able to be ran on quantum compute might allow us to resimulate earth as well as generally gain a lot more compute from the universe, especially if BQP ≠ P.

this whole framework splits the problem of aligned AI cleanly into two parts: the design of the AI-insulated goal-program, and the design of the AI whose goal will be to run said program. the goal-program’s insulatedness lets us design utopias or utopia-finding-programs which don’t have to deal with adverseriality from the AI, such as vaguely-friendly-NNs evaluating the quality of simulated worlds, or simulated researchers figuring out alignment with as much time as they need. i write more about goal-program design here.

it also resolves some questions of embedded agency: the goal-program is indeed smaller than the agent, so it might only need notions of embedded agency resolved for how it thinks about the outside world it’s turning to computronium.