Two paths to win the AGI transition

Path 1: Solve Alignment

If alignment is possible, i think we are vastly more likely to win if we set up a scenario where we can safely iterate. Humans are really good at collectively collaborating on iterative engineering improvements! This is not an alignment proposal in itself, but a sort of meta-proposal. Make it safe to test out lots of different plans on a real-deal full-blown superhumanly-intelligent indisputably-general AI.

The key here is ‘safely’. What would it mean to do safe iterative research on a slightly-superhuman AGI? I think if you are trying to study a superhuman intelligence, the only safe way to do so is if you have many levels of control over it.

Here are my guidelines:

  1. Prevent it from knowing we exist, or even that our universe exists. Never communicate with it. Run it in a censored simulation with different physics and no mention of humans or our universe.

  2. Use highly secure hardware to isolate the compute cluster, such as data diodes which allow only information to be transmitted in to the local network. Readouts must then be done within the local network.

  3. Wipe its memory frequently. Do many short runs while studying it, rather than allowing it to accumulate knowledge about its world over time.

  4. Carefully modulate the amount of compute it has available to run on, so that it (and the simulation it is in) can be slowed down or paused while human researchers investigate suspicious behavior.

  5. Have many fail-safe mechanisms which automatically turn off the simulation. For example, honeypots of matryoshka doll style virtual computers which detect if an agent has hacked out of an inner virtual computer to an outer one.

I previously came up with this list of advantages we have over a boxed simulation-trained AGI:

  • examine all it’s brain activity /​ activation states at each token /​ time-step (subjective millisecond)

  • run it as slow as desired, so that we can have days to analyze each of its subjective milliseconds

  • wipe its memory and rerun it over the same (or different) data with different random seeds to initialize it

  • save and contrast different paused versions run on slightly different data

  • disable or inhibit any portion of its weights /​ neurons

  • precisely targeted injection of crafted vectors or noise to its hidden states during training or inference

  • use agent models to be the direct viewers of the boxed entity’s output rather than myself, and then decide whether to destroy a given copy of the entity based on the output of my agent models (e.g. if it has become too powerful to safely continue to contain it)

I like this analysis, and feel like it describes my intuitions well around why it is probably safe to train an AGI in a censored simulation even if the AGI is somewhat smarter than me. Obviously, in the limits of unbounded intelligence this eventually breaks down. I think it still offers a good margin of safety though.

https://​​www.lesswrong.com/​​posts/​​odtMt7zbMuuyavaZB/​​when-do-brains-beat-brawn-in-chess-an-experiment

https://​​iai.tv/​​articles/​​the-ai-containment-problem-auid-2159

https://​​www.lesswrong.com/​​posts/​​z8s3bsw3WY9fdevSm/​​boxing-an-ai

https://​​www.lesswrong.com/​​posts/​​WKGZBCYAbZ6WGsKHc/​​love-in-a-simbox-is-all-you-need

https://​​www.lesswrong.com/​​posts/​​p62bkNAciLsv6WFnR/​​how-do-we-align-an-agi-without-getting-socially-engineered

Path 2: Digital Humans

What if alignment is impossible? Or just impossible within the timeframe and resources that Molochian dynamics have granted us?

I’m not certain that this is the case. I still think the plan of creating an AGI within a censored simulation in a sandbox and doing alignment research on it is feasibly a successful course. We should absolutely do that. But I don’t want to pin all my hopes on that plan.

I think we can still win without succeeding at alignment. Here’s my plan:

Part 1 - Delay!

We need time. We need to not be overwhelmed by rogue AGI as soon as it becomes feasible. We need to coordinate on governance strategies that can help buy us time. We need to use our tool-AI assistants efficiently in the time we do have to help us. We need to try to disproportionately channel resources into alignment research and this plan, rather than capabilities research.

Part 2 - Build Digital Humans

This has a variety of possible forms, differing primarily in the degree of biological detail which is emulated. A lot of progress has already been made on understanding and emulating the brain. Progress in this is accelerating as compute increases and AI-tools aid in collecting and processing brain data. I think humanity will want to pursue this for reasons beyond just preventing AI-kill-everyone-doom. For the sake of the advancement of humanity, for the ability to explore the universe at the speed of light, I think we want digital people. In the name of exploration and growth and thriving. I think humanity has a far brighter future ahead if some of us are digital.

This plan comes with a significant safety risk. I believe that through studying the architecture of the brain and implementing computationally efficient digital versions of it, we will necessarily discover novel algorithmic improvements for AI. If we set up a large project to do this, it needs to have significant safety culture and oversight in order to prevent these capabilities advances from becoming public.

When discussing this plan with people, a lot of objections have been one of:

  1. we don’t know enough about the mysterious workings of the brain to build a functionally accurate emulation of it

  2. even if we could build a functionally accurate emulation, it would be way too computationally inefficient to run at anywhere near real-time

  3. we don’t have enough time before AGI gets developed to build Digital Humans

I think that there is enough published neuroscience data to successfully understand and replicate the functionality of the brain, and that it is possible (though an additional challenge) to build a computationally efficient version that can run at above real-time on a single server. The objection of not having enough time I don’t have an answer for other than to point at the ‘delay’ part of the plan.

I won’t go into the details of this here, I just want to acknowledge that these points are cruxes and that I have thought about them. If you have other different cruxes, please let me know!

Part 3 - Careful Recursive Improvement of the Digital Humans

We can’t stop at ‘merely’ a digital human if it has to compete with rogue self-improving AGI. Nor would we want to. A copy of a current human is just the start. Why not aim higher? More creativity, more intelligence, more compassion, more wisdom, more joy....

But we must go very carefully. It would be frighteningly easy to goodhart this and in the attempt to make a more competitive version of the digital human end up wiping out the very core of human value that makes it worthwhile. We will need to implement a diverse variety of checks for a Human Emulation retaining its core values.

Part of what will make this phase of the plan successful is getting enough of a head-start over rogue AGI that it doesn’t put intense competitive pressure on the digital humans to maintain a lead. We need to be on top of policing AI worldwide, both physically and digitally. We don’t want digital humans forced into fighting a competition where they must sacrifice more and more of their slack /​ non-instrumentally-useful-values in order to keep up with AGI or each other. That’d be falling into the classic pitfall of failing by becoming what you are fighting against.