I don’t think this is very likely, but a possible path to alignment is formal goal alignment, which is basically the following two step plan:

Define a formal goal that robustly leads to good outcomes under heavy optimization pressure
Build something that robustly pursues the formal goal you give it

I think currently the best proposal for step 1 is QACI. In this post, I propose an alternative that is probably worse but definitely not Pareto-worse.

High-Level Overview

Step 1.1: Build a large facility (“The Vessel”). Populate The Vessel with very smart, very sane people (e.g. Eliezer Yudkowsky, Tamsin Leake, Gene Smith) and labs and equipment that would be useful for starting a new civilization.

Step 1.2: Mark The Vessel with something that is easy to identify within the Tegmark IV multiverse (“The Vessel Flag”).

Step 1.3: Leave the people and stuff in The Vessel for a little while, and then destroy The Flag and dismantle The Vessel.

Step 2: Define CCS as the result of the following:

Step 2.1: Grab The Vessel out of a Universal Turing Machine, identifying it by the Flag (this is the very very hard part)

Step 2.2: Locate the solar system that contains The Vessel, and run it back 2 billion years. (this is another very hard part)

Step 2.3: Put The Vessel on the Earth in this solar system, and simulate the solar system until either a success condition or a failure condition is met. The idea here is that the Vessel’s inhabitants repopulate the Earth with a civilization much smarter and saner than ours that will have a much easier time solving alignment. More importantly, this civilization will have effectively unlimited time to solve alignment.

Step 2.4: The success condition is the creation of The Output Flag. Accompanying the Output Flag is some data. Interpret that data as a mathematical expression.

Step 2.5: Evaluate this expression and interpret it as a utility function.

Step 3: Build a singleton AI that maximizes E[CCS(world)].

The Details

TODO: I will soon either update this post or make more posts with more details as I come up with them.

CCS vs QACI

QACI requires a true name of “counterfactual”, but that’s about it. It just needs to ask, “If we replace this blob with a question, what will most likely replace the answer blob?”. Physics and everything else is expected to be inferred from the existence of this “question” blob. CCS, on the other hand, requires a prior specification of an approximation of physics at least good enough to simulate an Earth with humans for billions of years, or maybe some weird ontology translation thing or something.
QACI is a function that must be called recursively (since we aren’t expecting anyone to solve alignment fully within the short interval), creating a big complicated graph. There are lots of clever tricks for preventing this from causing a memetic catastrophe, but there are lots of places these tricks can fail. CCS, on the other hand, only needs to be called once. The simulacra solving alignment have a LOT more time than we do, and they can build an entire civilization optimized around our/their goal.
QACI is vulnerable to superintelligences launched within the simulated world (since it is the modern world with all of its AI development, and there might be a bunch of timelines dying during the QACI interval without us realizing). CCS, on the other hand, simulates a very small world (just the solar system) with a civilization that will quickly become powerful enough to prevent any other intelligence to evolve.
The output is easier to “grab” from QACI, since it’s just a file on a computer that can straightforwardly be interpreted as a math expression. But I think if we figure out how to grab the vessel, we can probably use a very similar method to grab the output.
In general, CCS seems safer but also much harder than QACI.

CCS: Counterfactual Civilization Simulation

High-Level Overview

The Details

CCS vs QACI