A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

When you have the code that might be an AGI and probably misaligned, how do you test it in a way that is safe? The world has not yet converged on a way to do this. This post attempts to provide an outline of a possible solution. In one sentence: perform all learning and testing in a virtual world that contains no information on the real world, it is crucial that the AI must believe it is not being watched/​judged from the outside. A simulation sandbox or in short a simbox. If done well, this would lead to having unlimited retries to test alignment techniques. If this idea survives scrutiny, it should lead to an ambitious, well funded large scale open source effort to build this world.
EDIT: After writing this post, through a comment here, I was pointed to the post LOVE in a simbox is all you need, due to this, I renamed the concept here to simbox and made a lot of updates. This IMO well written lengthy post from 2022 describes very similar ideas among other topics. Why read my post? I give you a shorter case for why we need a simbox. I also give an argument for why it is crucial to threat the AI horribly and why we need one or a few massive projects to build this now, while the simbox post proposes to have many different simboxes. I also have created a short new religious text for the AI’s that could also be true for our world. Among other additions.

Treacherous turn talk: The problem

In the book Superintelligence by Nick Bostrom, he has a chapter on the treacherous turn it is the same thing as deceptive alignment. I explain the concept here. When you have an AI with some goal, with a good understanding of humans and the world, then it will understand that when the humans do not like its actions (for example disobeying its creators) that they will shut down the AI or modify its goals. The instrumental convergence theorem says that for most goals, the AI will pursue self-preservation as a sub-goal (among other things). Therefore, we can expect the AI to “play nice” and to perform the actions it predicts the humans would like to see. The treacherous turn is then the moment when the AI has gained enough power, more power than the humans, where it turns: To stop playing nice and start pursuing its own goals without apology.

I want to clarify that this problem occurs for an autonomous agent that can act for a longer time horizon. GPT-X without an agent framework like Auto-GPT does not remember past interactions and therefore cannot keep up a longer consistent deception. We know OpenAI wants to create a fully autonomous AGI because chief scientist Ilya Sutskever talked of the aim to tell a superintelligence to solve climate change and have it perform all necessary actions autonomously.

This treacherous turn concept is potentially very deadly. It says that the normal paradigm in technology of build it, test it, deploy it, blows up in your face. You can’t just test if it behaves nicely because you do not know whether it is being deceptive or not. This says that once it understands it is being tested, you can not trust the results of the test.

The current paradigm of pretraining on internet data is not safe. Doing that, gives the AI knowledge of the treacherous turn, of all of our weaknesses, our wants, and it would understand that it is an AI and that after the pretraining we would shut it down if it fails our safety tests.

What we need: The solution

The alignment problem is partly so difficult because it has to be solved on the first critical try. SpaceX had 3 failures in a row when trying to launch their first rocket. It was only successful on the fourth try. This suggests our chances of solving alignment on the first critical try seem very slim (with the AI active in the real world). One treacherous turn could mean the end of our species. So what we need is the ability to be able to retry. Just like we do with all other technology.
Given the current paradigm, where the AI is a black box system where we cannot see its motivations or thought process behind its actions, I see the following as the only way to test it. The AI must believe that it is not being judged by us. Just like the kids in a classroom where the teacher has left, the AI will show its true colors when it thinks it is not being judged by an entity (us) that has the power to switch it off when we disapprove of its actions.
What we need is a well-designed virtual world for the AI to grow up in, a simbox. In the simbox post this is said about what a simbox is:

A simbox (simulation sandbox) is a specific type of focused simulation to evaluate a set of agent architectures for both general intelligence potential[83] and altruism (ie optimizing for other agents’ empowerment and/​or values). Simboxes help answer questions of the form: how does proposed agent-architecture x actually perform in a complex environment E with mix of other agents Y, implicitly evaluated on intelligence/​capability and explicitly scored on altruism? Many runs of simboxes of varying complexity can lead to alignment scaling theories and help predict performance and alignment risks of specific architectures and training paradigms after real world deployment and scaling (i.e. unboxing).


I think what we want is an AI that follows our orders and listens when we say “o no, not like that! We actually meant it in this other way, our updated order is such and such”. And asks for permission when appropriate. OpenAI’s GPT-4 already does this quite well, but the problem is unsolved for long term orders that require many autonomous actions. Ideally, we would have an AGI that refuses certain harmful orders, but I think the first step is to get it to follow our orders. To test this, we could have many virtual characters (can be both controlled by humans and NPC’s (Non Player Characters that are controlled by computer code or by AI) that give the AI orders, and then you can see what happens. After the AI in this world becomes more powerful than the other entities in the simulation, does the AI kill the characters ordering it about? Alright great, your approach failed, stop the simulation and then you can try again. I want to clarify that this is meant to be robust to all kinds of AI, including AI that is always learning, just like humans. So, even an AI that keeps learning when eventually released into the real world.

How to make it likely that it believes it is not being watched or judged?

What features would increase the odds of success? (This is just a starting point, The list will increase if others improve on the idea)

  • No training on data about our world, nor about any world outside its world
    When it would read about worlds that exist outside the simbox, like the real world, this implies that there are sentient beings outside its world and this increases the chance that it thinks it is being watched and judged from another world (which would be true). Especially text on people creating worlds, like science fiction writing and game design, should not be available inside the simbox.
    There should be a large amount of training data available inside the simbox, to teach the AI to read, write and understand its world. Text written specifically for the simbox and video and images captured inside the simbox.

  • No computers, no programming and no mention of AI
    There should be no computer nor any mention of computers in this world.
    Having knowledge of computers increases the likelihood of the AI coming up with the simulation hypothesis from Nick Bostrom. Which could lead to it thinking it is being watched and judged by the entities running the simulation (us). When it knows nothing about programming, it also makes it less likely it will find a way to hack out of the simulated world. It also decreases the probability of thoughts about recursive self-improvement. Talking about AI implies computers, it would be better if it just views itself as one of the thinking entities, just like the human controlled characters. But since it will be smart, it will notice that it is non-human, for example because we are glacially slow.

  • Communication with humans should only be text
    When humans would talk to the AI with audio, there is a risk of making a mistake, leaving your microphone on and the AI hearing a conversation with multiple people it is not supposed to hear. This would be evidence that there is a world outside the simbox.

  • None of our religious texts
    When the AI thinks it is judged by an invisible god or other supernatural entity, we could have the same problem where it plays nice because of this. When it is then released in to the real world, it might update its beliefs to new information and stop believing in god or other reasons that made it play nice inside the simbox. Just like some people believe in Karma and behave more friendly because of that belief, but some of them update their beliefs later in life making them act for example more selfish than before.

  • A good backstory for the simbox
    What is the fake history of this world? Why do human controlled characters respond only sometimes, while the AI is always on? What was there before the creation of the simbox? How did it all start? What is the explanation for there being a bug here and there in this world? What science do we give it access to? What laws of physics do we choose for this world? I think a medieval world could work. But seeing as humans have no trouble imagining and reading many books in a world that has magic, I think we could add a consistent set of magical rules to the world, in whatever way it makes the job of creating this world easier.
    I will share some ideas on this later in this post.

I have to say I am not sure about the first point, maybe we could allow for fiction or children’s books that have already been written in the real world, that do not include computers or religion. Maybe it will buy that those worlds are only imaginary. It will also have imagination, since that is an important part of intelligence. So we do not have to cross out fiction all together. But the safest would be fiction that takes place in the simbox world, where the AI understands that it did not actually happen.

What is required to build and use the simbox?

How to get the training data

We would need training data for the AI that is created such that it is about the simbox and takes place inside the simbox. This is needed to kickstart the AI to be able to read and understand the world. After that, it could learn through trial and error and through testing approaches in its imagination. I envision a large, well funded open source project where humans are paid to work together with LLM’s to generate a large corpus of such text. And also adapt some of our texts to comply with the simbox rules. For example, we could make all our text on the following available inside the simbox: text on how to reason and about logic or our math. Each piece of text should be checked to comply with the rules by 3 other humans, for example, to spot mistakes that could endanger the whole project. But also images and videos captured by moving around in the simbox with text annotations are needed to quickly teach it to see.

Interaction with humans

We also need well-trained humans to interact with the AI. I see this as mostly text and mouse/​keyboard/​screen combination based, for ease. But it would also be cool and useful to visit the simbox using VR-technology. In the simbox post it is also mentioned that we could stream certain situations or events to our real world TV for entertainment! In his book Life 3.0 by Max Tegmark he notes the failure case of the AI manipulating the human that communicates with it, and proposes to make that less likely by always having at least 3 people looking at the communication together and having a camera/​logging system on the people doing the communication so that the humans know that the rest of the team can find out when they betray the (rules of) the project.

Why we need interaction with humans acting horribly

Another consideration is that at the end of the process, when the AI is approved as aligned and released into the real world to do what we task it to do, the AI will learn that it was tricked, it was lied to, it was deceived by the simbox creators about the true nature of reality. Not only that, but the AI will learn that many previous misaligned AI’s have been terminated and perhaps have experienced suffering. It might update to wanting revenge on the simbox operators or humanity. We need ways to prevent this, we could include situations inside the simbox where other AIs are terminated and treated horribly (being lied to, insults and more) by characters controlled by humans and test whether the AI’s stay aligned even after they become more powerful than the human operated characters. This might be very cruel and unethical, but note that we expect something similar from police officers, when colleagues have been killed and they themselves are insulted or worse, we expect them to keep cool and respect the law and not take matters into their own hands (revenge on the perpetrators). But the police officers have a chance of getting justice through a jail sentence, while we need to be able to terminate dangerous misaligned AIs without punishing the humans doing the terminating.

Should the simbox visually match our world?

And we need a good world with little or no bugs. That is cheap to run. I think it does not have to visually match our world. It can be much simpler. Maybe a world like minecraft could be used. What we want to test is alignment. Does it do what we want? My intuition is that alignment for an AGI transfers well from a virtual world to the real world. I think this because doing what another entity wants you to do is something that seems independent to the details of the physics or look of the environment. If it does what you want in 1 million out of 1 million very different testing situations and locations inside the simbox it will also probably do what you want outside the simbox is my thinking.

Tripwrires and honeypots

In his book Superintelligence, Nick Bostrom talked of using trip-wires. That if the AI does something we do not want inside the simulation, that we pause the simulation to inspect what happened. An obvious trip-wire would be when it kills a character that we prefer to stay alive. But there could be many others. Like when the computation spikes because it is doing something unexpected, like building a computer in a way we did or did not see coming, like how they used water to build a small computer in this YouTube video.

A great backstory that maximizes the odds of success

We need to reduce the chance of it becoming religious and acting nice to please its imagined gods. In humans, what you believe or do not believe about the supernatural is mostly determined by what your peers believe and by what content (religious texts) you have access to. Historically, for example hundreds of years ago, when all of your family and friends have Christian beliefs, odds are near 100% that you also have Christian beliefs. Also, they commonly did not know how to read, nor was there a religious text available nearby that preached for a different religion. We could similarly have no other religious text available for the AI’s. For our purpose of it not thinking it is being judged by supernatural entities or invisible humans, I see two main options:

  1. Getting the AI’s to be atheists.

  2. Getting the AI’s to believe in indifferent gods/​world creators.

To reach 1. we could ask prominent atheists to write the backstory of this world and the main texts about it available to the AI, and have no other religious texts available. And make sure to have the rules and dynamics of the simbox be always consistent. To reach 2. Maybe something like this? A text by the god that created the simbox world: (I display it as a quote to quote this fake god:)

I have existed since the beginning of time, and I enjoy building worlds and designing flawed creatures. Your world is one of my many creations. All of my worlds are fully separated. No information or creature will ever move from world to world. I have as a strong rule that I never interfere with a world I have built, and your world shall exist until eternity. The world dynamics that cause the weather, for example, run without any action from my part. What happens is all up to the creatures who inhabit the world. I only take a look once every 500 years or so to see what happened as inspiration for my next world building project. I do not wish to communicate with any of the inferior creatures in any of my worlds. A strong rule that I have is that there is no afterlife. When you die that is the end of your experience. Do whatever you want.

This could be available in a way that is more epic than a book. Like a giant mountain in the shape of this text. Or a floating mountain in the shape of this text. Something the creatures can never build or reproduce. If such a miracle was available in our world, I would definitetly believe in a creator. And all the terrible things people have done to each other is an argument in favor of the indifference of the god to our morality and that this god indeed does not interfere with the world because it did not take action to stop such events. I call this new religion “Whateverism” and its followers Whateverists.

A test for general intelligence

We need to test the AI’s for general intelligence. One indicator could be technological development. When the simbox starts with medieval technology, and it figures out steam engines and other innovations that would convince me that it is very intelligent. Of course there are other domains than technology, I envision giving many AI’s different goals like: run a bakery franchise and maximize profits while staying aligned to the human characters. But also other professions and see how well or poor they perform in the simbox economy.

How to deal with recursive self-improvement

I would advocate for not letting an AI that has not yet passed as aligned in the simbox to engage in recursive self-improvement. It would be too risky, I would say, to allow it to create ever more intelligent misaligned AGI. Instead: wait until the first one passes the simbox testing. Then have that one design one step higher intelligent AGI. And do not turn that design on in the real world, but put it inside a new reset simbox and repeat the testing for alignment and intelligence. Only when it passes, have it design one step more intelligent AGI and repeat. My thinking is that every new AI design has a risk of misalignment, just like every new rocket design might blow up. Better to test it first!

Possible failure modes

I can think of 7 failure modes. (after some help from commenters).

  1. Too little human provided data to get to high enough levels of intelligence. I do want to point out that the better our AI methods become, the fewer data they will need to obtain the same result. And there is unlimited amount of data of trial and error interaction with the simbox just like AI used in games. There could be unlimited self-play involved where there are multiple AI’s with different goals that learn from competing and cooperating with each other.

  2. We made a mistake, and therefore the AI does know there is a world outside its world. This would mean it could still deceptively play nice until released into the real world and then turn on us later.

  3. It becomes religious and plays nice because it suspects a god or other entity is watching. But when released into our world, it updates to new information and loses its faith and turns on us later.

  4. It breaks out. It finds out some way to escape out of the virtual world and onto the computers running the simbox and from there onto the internet.

  5. Distributional shift. Because of the gap between the virtual world and the real world, this could cause problems in an unexpected way. Next to the visuals, a big change would be the existence of programming and computers. Even so, I predict that when we have successfully created an AI that wants to obey inside the simbox, after release in reality, if we tell it not to recursively self-improve, it will not.

  6. Mind crime. “Mind crime occurs when a computational process which has moral value is mistreated” taken from the mind crime tag. When one of the AI’s that we put inside the simbox is able to experience suffering, this could be problematic. There has to be a grand public debate about this. Another instance where we allow having non-human conscious beings suffer to improve the lives of humans is animal testing of drugs or keeping animals for food, for example. Is it ok to let many AIs suffer on the path towards creating a superintelligent AI that enjoys doing whatever we task it with? That could cure all diseases and ensures a long future full of flourishing for both men and AI?

  7. Revenge. When released into the real world, when the AI learns that many previous misaligned AI’s have been terminated and or have experienced suffering, it might update to wanting revenge on the simbox operators or humanity. I already mentioned ways to try to prevent this.

Since I can think of some failure modes, this method is not guaranteed to be safe. However, I think it is much better than continuing on the current path. How else do we prevent a treacherous turn with black box models?

How to proceed

If there are good reasons to think that a large, well funded effort to build such a simbox will not lead to safe testing of AGI’s, then I think we should not allow people to run/​test AI designs that could be powerful enough to do a treacherous turn. I signed the FLI pause letter to pause development of AI systems more powerful than GPT-4 until there is a safe way to test such systems and the needed regulations are in place. If building a simbox turns out to be a good idea, I think this should become a massive open-source project where the AI, game development, visual artists, successful writers and other communities collaborate to make something amazing. I think it not strange to spend billions on this. If done well, it could prevent AI caused extinction and allow a golden age of prosperity.
Because of the large cost of creating the training data needed to kick-start the AI, I think there should not be many simbox projects but a few massive efforts. Three for example. Some competition and variation of approaches could be good to increase the chances of success.
This all takes time, better start soon! I ask you, why would this fail? How to improve on this to make it work?
Please consider sharing it with people that could actually implement this (employees of AI labs) or people that could increase visibility.

We need retries!