Yet another safe oracle AI proposal

Previously I posted a proposal for a safe self-improving limited oracle AI but I’ve fleshed out the idea a bit more now.

Disclaimer: don’t try this at home. I don’t see any catastrophic flaws in this but that doesn’t mean that none exist.

This framework is meant to safely create an AI that solves verifiable optimization problems; that is, problems whose answers can be checked efficiently. This set mainly consists of NP-like problems such as protein folding, automated proof search, writing hardware or software to specifications, etc.

This is NOT like many other oracle AI proposals that involve “boxing” an already-created possibly unfriendly AI in a sandboxed environment. Instead, this framework is meant to grow a self-improving seed AI safely.


  1. Have a bunch of sample optimization problems.

  2. Have some code that, given an optimization problem (stated in some standardized format), finds a good solution. This can be seeded by a human-created program.

  3. When considering an improvement to program (2), allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance—k * bits of optimizer program).

  4. Run (2) to optimize its own code using criterion (3). This can be done concurrently with human improvements to (2), also using criterion (3).


First, let’s say we’re writing this all in Python. In real life we’d use a language like Lisp because we’re doing a lot of treatment of code as data, but Python should be sufficient to demonstrate the basic ideas behind the system.

We have a function called steps_bounded_eval_function. This function takes 3 arguments: the source code of the function to call, the argument to the function, and the time limit (in steps). The function will eval the given source code and call the defined function with the given argument in a protected, sandboxed environment, with the given steps limit. It will return either: 1. None, if the program does not terminate within the steps limit. 2. A tuple (output, steps_taken): the program’s output (as a string) and the steps the program took.


  def function(x):
    return x + 5
""", 4, 1000)

evaluates to (9, 3), assuming that evaluating the function took 3 ticks, because function(4) = 9.

  def function(x):
    while True: # infinite loop
""", 5, 1000

evaluates to None, because the defined function doesn’t return in time. We can write steps_bounded_eval_function as a meta-circular interpreter with a bit of extra logic to count how many steps the program uses.

Now I would like to introduce the notion of a problem. A problem consists of the following:

  1. An answer scorer. The scorer should be the Python source code for a function. This function takes in an answer string and scores it, returning a number from 0 to 1. If an error is encountered in the function it is equivalent to returning 0.

  2. A steps penalty rate, which should be a positive real number.

Let’s consider a simple problem (subset sum):

{'answer_scorer': """
  def function(answer):
    nums = [4, 5, -3, -5, -6, 9]
    # convert "1,2,3" to [1, 2, 3]
    indexes = map(int, answer.split(','))
    assert len(indexes) >= 1
    sum = 0
    for i in indexes:
      sum += nums[i]
    if sum == 0:
      return 1
      return 0
 'steps_penalty_rate': 0.000001}

We can see that the scorer function returns 1 if and only if the answer specifies the indexes of numbers in the list nums that sum to 0 (for example, ‘0,1,3,4’ because 4+5-3-6=0).

An optimizer is a program that is given a problem and attempts to solve the problem, returning an answer.

The score of an optimizer on a problem is equal to the score according to the answer-scorer, minus the steps penalty rate multiplied by the number of steps used by the optimizer. That is, the optimizer is rewarded for returning a better answer in less time. We can define the following function to get the score of an optimizer (Python source code) for a given problem:

def problem_score(problem, optimizer_source):
  # run the optimizer on the problem
  result = steps_bounded_eval_function(
    optimizer_source, problem, 1 / problem['steps_penalty_rate'])
  if result == None: # used all available steps, or got an error
    return 0.0
  answer, steps_taken = result # optimizer returned a result in time
  # get the score using the problem's answer_scorer
  answer_score = eval_function(problem['answer_scorer'], answer)
  assert 0 <= answer_score <= 1
  # penalize for taking up time and make sure the result is non-negative
  return max(0.0, answer_score - problem['steps_penalty_rate'] * steps_taken)

A simple optimizer that detects subset-sum problems (by inspecting the source code) and uses a brute-force method to solve it will get a problem_score close to 1 for small subset-sum problems. This optimizer would not do well on other problems; other techniques (such as evaluating the problem source code with different answers and choosing a good one) are needed to solve a variety of different optimization problems. Writing a good optimizer is very hard work, just like any seed AI.


The framework has 4 modules:

  1. A set of training problems. These should cover a wide range of problems that we would like the AI to solve.

  2. An optimizer, written in Python. This should be seeded with a very good human-written optimizer. This is deliberately unspecified by me because it’s a really hard problem (as is any seed AI).

  3. A scorer for optimizer source code defined as follows:

    def optimizer_score(candidate_optimizer_source):
      training_problems = [ problems here...]
      # a parameter of the system; more on this later
      complexity_penalty = 0.1
      # total up the program's score on all training problems
      total_score = 0.0
      for problem in training_problems:
        total_score += problem_score(candidate_optimizer_source, problem)
      # penalize for complexity, to prevent overfitting
      total_score -= complexity_penalty * len(compress_binary(candidate_optimizer_source))
      # return average score
      return max(0, total_score / len(training_problems))

    This gives a candidate optimizer a score in the range [0, 1] based on both its average performance on the sample set and its inherent complexity. Presumably optimizers with a higher optimizer_score will do better on future optimization problems.

  4. A self-optimization thread. This thread continuously runs program 2 on a problem formed using 3′s answer_scorer and an ever-decreasing steps_penalty_rate. Whenever program 2 outputs source code (optimizer_source) that is better than the current source code for 2, the source code for 2 is replaced with this new value. Also, humans can make improvements to program 2 if it increases its score according to 3′s answer. Source code:

    # assume we have access to an optimizer_source variable (program 2)
    def self_optimization_thread():
      start_steps_penalty_rate = 0.000001
      steps_penalty_rate = start_steps_penalty_rate
      while True: # loop forever
        self_optimization_problem = {
          # just use program 3 to score the optimizer
          'answer_scorer': """
            def function(candidate_optimizer_source):
              ... put the source code from program 3's optimizer_score here
          'steps_penalty_rate': steps_penalty_rate
        # call the optimizer (program 2) to optimize itself, giving it limited time
        result = steps_bounded_eval_function(
          optimizer_source, self_optimization_problem, 1 / steps_penalty_rate)
        changed = False
        if result is not None: # optimizer returned in time
          candidate_optimizer = result[0] # 2 returned a possible replacement for itself
          if optimizer_score(candidate_optimizer) > optimizer_score(optimizer_source):
            # 2's replacement is better than 2
            optimizer_source = candidate_optimizer
            steps_penalty_rate = start_steps_penalty_rate
            changed = True
        if not changed:
          # give the optimizer more time to optimize itself on the next iteration
          steps_penalty_rate *= 0.5

So, what does this framework get us?

  1. An super-optimizer, program 2. We can run it on new optimization problems and it should do very well on them.

  2. Self-improvement. Program 4 will continuously use program 2 to improve itself. This improvement should make program 2 even better at bettering itself, in addition to doing better on other optimization problems. Also, the training set will guide human improvements to the optimizer.

  3. Safety. I don’t see why this setup has any significant probability of destroying the world. That doesn’t mean we should disregard safety, but I think this is quite an accomplishment given how many other proposed AI designs would go catastrophically wrong if they recursively self-improved.

I will now evaluate the system according to these 3 factors.

Optimization ability

Assume we have a program for 2 that has a very very high score according to optimizer_score (program 3). I think we can be assured that this optimizer will do very very well on completely new optimization problems. By a principle similar to Occam’s Razor, a simple optimizer that performs well on a variety of different problems should do well on new problems. The complexity penalty is meant to prevent overfitting to the sample problems. If we didn’t have the penalty, then the best optimizer would just return the best-known human-created solutions to all the sample optimization problems.

What’s the right value for complexity_penalty? I’m not sure. Increasing it too much makes the optimizer overly simple and stupid; decreasing it too much causes overfitting. Perhaps the optimal value can be found by some pilot trials, testing optimizers against withheld problem sets. I’m not entirely sure that a good way of balancing complexity with performance exists; more research is needed here.

Assuming we’ve conquered overfitting, the optimizer should perform very well on new optimization problems, especially after self-improvement. What does this get us? Here are some useful optimization problems that fit in this framework:

  1. Writing self-proving code to a specification. After writing a specification of the code in a system such as Coq, we simply ask the optimizer to optimize according to the specification. This would be very useful once we have a specification for friendly AI.

  2. Trying to prove arbitrary mathematical statements. Proofs are verifiable in a relatively short amount of time.

  3. Automated invention/​design, if we have a model of physics to verify the invention against.

  4. General induction/​Occam’s razor. Find a generative model for the data so far that optimizes P(model)P(data|model), with some limits on the time taken for the model program to run. Then we can run the model to predict the future.

  5. Bioinformatics, e.g. protein folding.

These are all problems whose solutions can be efficiently evaluated and that a computer could plausibly solve, so I think this framework should provide good solutions to them. If the optimizer this framework produces performs well on all these problems, I think it deserves to be called an oracle AGI.


It seems that, if my arguments about optimization ability are convincing, an optimizer that does very well on the sample problems should also do well on the specific optimization problem of improving itself. This is especially true if the training problems include some program-optimization problems (i.e. finding a program that does well on some task). Once the optimizer improves itself, it will be better at improving itself, and so on. If the AI is software-limited, the framework should allow it to recursively improve itself up to a very high limit. Then we can use its hardware design power to design better chips for it and allow it to optimize even further.

The optimizer’s ability to self-improve is limited by the fact that the other parts of the program (1, 3, and 4) are fixed. The system is greedy in the sense that it just looks for even the slightest improvement, then changes the code and looks for more improvements, without any foresight. I don’t know how much of a limitation this is. I don’t see any reason that greedy improvements to the optimizer are insufficient for super-optimization from recursive self-improvement.

Currently, the optimizer must completely derive a new optimizer from the specification (program 3), without looking at the current best optimizer (its own source code). In practice we might want to extend the framework to include incremental improvements (improvements on top of existing good answers). I don’t know if a good theory for scoring incremental optimizers exists.


Probably the most important property of the system. What use is optimization if the world has been converted to computron in order to power a super-powerful protein folder? Here are a few different possible threats:

Underhanded question answers

The optimizer could give answers that are very good according to the problem description but are specifically crafted to cause things unintended by the human operators. For example, it could create a software design that, while correct according to specifications, has a “bug” that causes it to hack into computers on the internet and hook them up to the optimizer system. I don’t think this is at all likely because the system is extremely short-sighted. Any self-modification must improve its immediate ability to do better on optimization problems. Furthermore, all self-modifications are found using methods that are judged by only immediate improvements. So any kind of long-term strategy (sacrificing some optimization power so it will have resources in the future) is out of the question. In other words, optimizer_score should not be seen as a utility function because the system only tries to improve it using greedy methods, not long-term planning.

Bugs in the system

What if the system gives the optimizer an incorrect score under some circumstances (e.g. if it performs a certain pattern of memory accesses)? Say that, by chance, the optimizer’s improvement to itself causes it to get an incorrect score. It might internalize the rule “perform memory accesses in this pattern” to get a higher score. This itself is not especially dangerous; the optimizer will rewrite itself to just do a bunch of weird memory accesses that give it a high score.

What might be more dangerous is if the optimizer discovers an underlying pattern behind the system’s hackability. Since the optimizer is penalized for complexity, a program like “do things that, when executed on a certain virtual machine, cause this variable in the machine to be a high number” might have a higher score than “do this certain complex pattern of memory accesses”. Then the optimizer might discover the best way to increase the score variable. In the absolute worst case, perhaps the only way to increase the score variable is by manipulating the VM to go on the internet and do unethical things. This possibility seems unlikely (if you can connect to the internet, you can probably just overwrite the score variable) but should be considered.

I think the solution is straightforward: have the system be isolated while the optimizer is running. Completely disconnect it from the internet (possibly through physical means) until the optimizer produces its answer. Now, I think I’ve already established that the answer will not be specifically crafted to improve future optimization power (e.g. by manipulating human operators), since the system is extremely short-sighted. So this approach should be safe. At worst you’ll just get a bad answer to your question, not an underhanded one.

Malicious misuse

I think this is the biggest danger of the system, one that all AGI systems have. At high levels of optimization ability, the system will be able to solve problems that would help people do unethical things. For example it could optimize for cheap, destructive nuclear/​biological/​nanotech weapons. This is a danger of technological progress in general, but the dangers are magnified by the potential speed at which the system could self-improve.

I don’t know the best way to prevent this. It seems like the project has to be undertaken in private; if the seed optimizer source were released, criminals would run it on their computers/​botnets and possibly have it self-improve even faster than the ethical version of the system. If the ethical project has more human and computer resources than the unethical project, this danger will be minimized.

It will be very tempting to crowdsource the project by putting it online. People could submit improvements to the optimizer and even get paid for finding them. This is probably the fastest way to increase optimization progress before the system can self-improve. Unfortunately I don’t see how to do this safely; there would need to be some way to foresee the system becoming extremely powerful before criminals have the chance to do this. Perhaps there can be a public base of the project that a dedicated ethical team works off of, while contributing only some improvements they make back to the public project.

Towards actual friendly AI

Perhaps this system can be used to create actual friendly AI. Once we have a specification for friendly AI, it should be straightforward to feed it into the optimizer and get a satisfactory program back. What if we don’t have a specification? Maybe we can have the system perform induction on friendly AI designs and their ratings (by humans), and then write friendly AI designs that it predicts will have a high rating. This approach to friendly AI will reflect present humans’ biases and might cause the system to resort to manipulative tactics to make its design more convincing to humans. Unfortunately I don’t see a way to fix this problem without something like CEV.


If this design works, it is a practical way to create a safe, self-improving oracle AI. There are numerous potential issues that might make the system weak or dangerous. On the other hand it will have short-term benefits because it will be able to solve practical problems even before it can self-improve, and it might be easier to get corporations and governments on board. This system might be very useful for solving hard problems before figuring out friendliness theory, and its optimization power might be useful for creating friendly AI. I have not encountered any other self-improving oracle AI designs for which we can be confident that its answers are not underhanded attempts to get us to let it out.

Since I’ve probably overlooked some significant problems/​solutions to problems in this analysis I’d like to hear some more discussion of this design and alternatives to it.