a narrative explanation of the QACI alignment plan

Link post

a bunch of people are having a hard time understanding my question-answer counterfactual interval (QACI) alignment proposal, so i’m writing out this post which hopefully explains it better. in this scenario, cindy the human user uses an AI implementing of QACI to save the world.

this is not the only way i’m thinking of QACI going, but merely one example of how it could go, if it is to go well. it has many assumptions that aren’t explained; it’s meant to give people an idea of what QACI is aiming for, not as much to be a justification of its feasability. that said, while i don’t expect a successful QACI to go exactly like this, i think this narrative captures the essential aspects of it.

ideally, i’d want to collect various objections people have to this scheme (such as disagreements about assumptions it seems to rely on), give my answers and/​or adjust the plan accordingly, and make a new post about those objections and updates.

first, cindy is “in on this”: she’s aware of how the entire scheme is meant to function. that is required for it to actually work.

  1. cindy is in a room, in front of a computer, with cameras filming her. the camera’s footage is being recorded on said computer.

  2. cindy walks to the computer and launches aligned-AI-part-1.exe.

  3. aligned-AI-part-1.exe uses her webcam and maybe other sources to generate 1 gigabyte of random data. we call this blob of data the question. the data is stored on her computer, but also displayed to her — eg opened as plaintext in a text editor.

  4. cindy is now tasked with interpreting this data as a prompt, and notices that the it looks like random garbage — and she knows that, when the data looks like random garbage, she should type out relatively uniquely answer-identifying data that depends both on her and on the question, so she does just this. for example, she might type out whatever things she’s talking about, various hashes of the input data, various hashes of data that is unique to her (such as the contents of her hard drive), stuff like that. this blob of data is called the answer. the reason the uniqueness is important is so that the blob of data actually uniquely identifies the answer typed by cindy, which would be different if cindy got a different question. whereas, if the answer was for example 1GB of zero’s, this probly matches many empty text files that exist in many places on earth; or, if it’s some simple pattern, maybe it can be guessed by alien superintelligences in acausal attacks in some way — and then, our AI would consider these to be valid candidates for which part of the world is the answer. maybe there’s some clever algorithmic way to “entangle” the answer with the question, or something.

  5. once the 24 hours are over, she launches aligned-AI-part-2.exe.

  6. aligned-AI-part-2.exe is the meat of the project. it launches a recursively self-improving AI which we’ll call AI₀ that eventually reaches superintelligence, and executes whichever action is its best guess as to what maximizes its formal goal: to maximize whichever utility function (as a piece of math) would be returned by the (possibly computationally exponentially expensive) mathematical expression E which the world would’ve contained instead of the answer, if in the world, instances of question were replaced with just the string “what should the utility function be?” followed by spaces to pad to 1 gigabyte. we’ll shorten this to QACI("what should the utility function be?"). this is where a lot of the complexity of QACI is, so don’t worry if you don’t get it — hopefully the rest of this narrative is gonna explain it.

  7. AI₀ eventually emits a best guess: a different AI program, AI₁, in which AI₀ has implemented embedded agency and things like that because AI₁ can see that its output is intended to be ran inside a world. AI₀ will have make sure AI₁ is aligned with itself, of course: AI₁ is just an extra step towards the formal goal mentioned above.

  8. AI₁ starts thinking more seriously about its formal goal. clearly, it’s gonna need to learn a lot more about the world to locate instances of question and answer in it; so it starts accessing the internet and learning about the world.

  9. AI₁ comes to the (true) conclusion that this world seems to contain what we’d call computers, that it’s running on one such thing, and that this computer is basically the thing that generated question, emitted it into the world and received answer. so AI₁ thinks to itself “okay, let’s say question was replaced with “what should the utility function be?”. what would happen next?”

  10. AI₁ finds camera footage of the room, and thinks “aha! it looks like these things my data has talked about, a “human”, was a pretty important part of what turned question into answer. i wonder what other answer this “human” would’ve typed into the computer if instead of the question it did get, it instead got “what should the utility function be?” as a question.” (note how we never need to tell any AI the true name of “human” or “computer” or anything like that; we set up QACI such that it indirectly points to what we want, and then figuring out those complex concepts in the world is up to the AI to model in whatever way it wants)

  11. AI₁ starts trying to guess using its limited compute and data, but clearly that data isn’t enough. nevertheless, AI₁ figures out some stuff:

    • these “humans” have things they “want”

    • this “human”, who used the computer, seems to be intending to use this whole process AI₁ is in seems to do things it “wants”

    • AI₁ should probly be a bit careful about affecting this world, because this “human”’s “want”s seem fragile and its world seems brittle. so, if the utility function that E would eventually return does correspond to what this “human” would “want”, which does seem like a reasonable possibility at this point, then it should try to act in ways that don’t destroy its expected future utility.

  12. in order to get more compute and data, AI₁ very carefully hacks the internet, takes over the world, maybe prints nanobots and turns large uninhabited parts of the world into compute, and starts using its newfound access to real-world data and computing power to make better guesses as to what utility function E would eventually return.

  13. finally! it looks like AI₁ now has the compute to make some real good informed guesses about E. in order to get more information about this “human” that was in the room called “cindy”, AI₁ also brainscans cindy.

  14. (cindy cooperates because she knows this was a reasonable possibility and the fact that the AI wants to do this is a sign that things are working well so far)

  15. it’s time for a simulation! AI₁ imagines replacing all instances of question in the world, prior to its launch, with “what should the utility function be?”, looking for what blob of data will take the place of answer. so, it starts running a (rough) simulation of the room cindy was in, with the computer and the cameras and everything, because that’s where most of what mattered to the answer seemed to be located.

  16. a cindy inside the simulation reads the question blob on the computer. instead of being 1 gigabyte of random garbage, it’s a plain text file that starts with “what should the utility function be?” followed by many spaces. she cheers; clearly, this alignment scheme is working as intended, indeed the AI has gotten to the point of running a simulation of her! well, it’s time to get to work.

  17. cindy starts thinking about solutions to alignment. maybe she chats with other alignment researchers who are on board with this project — which AI₁ now also needs to brainscan in order to simulate them accurately. after 10 hours or so, she writes into the computer the mathematical expression QACI("okay, i know this is only the first iteration, but here are my very first guesses as to what utility function to maximize to create utopia…" followed by a bunch of ideas). the crucial point here is that, because this first cindy-simulation is allowed to return not just a literal utility function, but a mathematical expression which evaluates to a utility function, she gets to make that mathematical function depend on whatever computation she wants, including another simulation of her question-answer interval. in fact, these can call each other in an arbitrarily deep stack.

  18. AI₁ now has its counterfactual answer. “god damn, the mathematical expression this “human” returned requires me to run another simulation of it to get the result? well, i guess it can’t be helped, i’ll have to repeat this process until one of the iterations of the human returns an actual utility function, instead of passing the buck to further simulations”. it runs this second simulation.

  19. inside the second simulation, cindy reads the message on the computer, and it reads “okay, i know this is only the first iteration, but here are my very first guesses as to what utility function to maximize to create utopia…” followed by a bunch of ideas. again, she cheers; this instance of her has no memory of the first iteration of her simulation, so the fact that the alignment scheme works is news to her. she works, again for about let’s say 10 hours, and then writes as answer QACI("2nd iteration cindy here, here's the tiny bit of progress i made on alignment…" followed by more ideas)

  20. in this way, AI₁ simulates (or maybe, in some clever way, skips past simulating and gets straight to the result) cindy-and-her-world for many iterations.

  21. eventually, one of the cindy’s has gotten to a solution that she’s pretty confident about: it’s not a full explicit utility function of what is good, but it’s a much better CEV process, inspired by QACI but different in many ways. instead of returning QACI(some string), she returns this new mathematical object.

  22. AI₁ goes “huh, that’s new! welp, time to run that”.

  23. on the outside, AI₁ has a strong enough hold on the world to ensure its continuation no matter what; but also, AI₁ can see that whatever this sequence of simulations will eventually end up in, it will probly entails humans not being murdered or suffering needlessly, so it avoids things that would cause that. in particular, it makes damn sure to stop anyone else from launching superintelligent AI.

  24. eventually, after a bunch more such macro-iterations, a utility function that creates utopias is returned, and AI₁ finally maximizes it in the world, creating utopia. in the meantime, perhaps it has been implementing increasingly accurate approximations of that utility function, and already launched into space many copies of itself tasked with running the same sequence of simulations and maximizing their utility functions in the rest of the lightcone.