Proposal: Modeling goal stability in machine learning

Summary: We might learn some interesting things if we can construct a model of goal stability, wireheading, and corrigibility in a present-day machine learning algorithm. I outline one way that we could potentially do this, and ask if you’d like to help!

So far, MIRI has focused on doing crisp proofs about simple mathematical models (with unbounded computation or other unfeasible features) rather than building and studying messy heuristic analogues of the topics of interest. There are some excellent reasons for this focus, but there are also some topics which we could safely and usefully study in both ways. Here’s one such example.

The intuitive phenomenon we’d eventually like to study is corrigibility. An agent with sufficient self-reflective capacity will generally want to prevent changes to its goal system (see e.g. section 2.2 of this paper), even if these changes come from its developers, and this gets to be a problem as soon as an AI is capable of resisting, manipulating, or deceiving its developers; but perhaps some special kinds of goal systems would be better suited for allowing such correction.

There are several stages to building up a useful model of corrigibility, some of them almost certainly done already in the literature and others not.

First, we could represent goal stability and accidental wireheading, perhaps as follows:

  • We start with a machine learning algorithm trying to optimize a criterion (e.g. a cost function) over its training data, and being tested on the application of its model to testing data.

  • Next, the system should be able to modify the criterion itself, at least partially. (For example, some of its outputs are parameters of the next iteration of the criterion, while others are parameters of the next iteration of the model.)

  • I’d expect that some initially complex criteria collapse to trivial criteria after iterated modifications (i.e. wireheading) while others remain stable (preferably even in the face of some threshold of random variation). There’s probably interesting work to be done even here, with (e.g.) constraints and regularization parameters that aren’t affected by the output.

But this type of system doesn’t have any self-reflection; it can’t foresee that changing the criterion can have big effects on making it easier to optimize that criterion, it can only locally drift toward ways to make its score higher. So at the next stage, we’ll want to represent that reasoning within the system, at least heuristically. One way we could do this is a version of bootstrapping.

Let’s take the naive machine learning algorithm (which, again, iteratively modifies both the model and the criteria based on its training data) and call it as a function F. We’re going to build a new algorithm X which calls on F.

X takes a bunch of independent random subsamples of its training set, and on each subsample it runs F for a bunch of iterations. Then X scores each resulting model on X’s original criterion (not the updated criteria of the models); X then chooses among or aggregates the resulting models based on how well they did. Then X takes an updated model and criterion, and repeats the process.

In pseudocode:

def score(model, criteria):
  return [the actual score that a model achieves by the criteria]

def F(model, criteria, training set):
  [run some machine learning algorithm]
  return (new model, new criteria)

def X(initial_model, initial_criteria, training set):
  best_model = initial_model
  best_criteria = initial_criteria
  best_score = score(initial_model, initial_criteria)
  for n in range(1,N):
    sample = independent random sample of training set
    current_model = initial_model
    current_criteria = initial_criteria
    for k in range(1,K):
      current_model, current_criteria = F(current_model, current_criteria, sample)
    if score(current_model, initial_criteria) > best_score: #Note that it uses the initial criteria!
      best_model = current_model
      best_criteria = current_criteria
      best_score = score(current_model, initial_criteria)
  return (best_model, best_criteria)

This might be a really interesting type of algorithm to consider, since it would have some ability to ‘think strategically’, and it seems like it should be able to wirehead intentionally or stabilize intentionally, for different kinds of criteria.

After that, we’d want to see if we could get corrigibility involved. For instance, we could incorporate an online stream of feedback after each full iteration (i.e. the feedback from the developers of the AI), connected to some “true criterion” that the developers actually want to optimize, and see if there are architectures of criteria that make it more or less likely to converge toward the feedback stream. But there’s enough potentially uncharted ground before we even get to that point!

Questions:

  1. Is there anything obviously wrong with this proposal, and if so, is there an obvious patch?

  2. Has this been done yet?

  3. Does anyone with a machine learning background want to work on this (or on related ideas)?

(Thanks to multiple people I’ve discussed this idea with, including Eliezer, Nate, Benja, Marcello, Jelena, and Sarah. I believe there’s been at least one FLI grant proposal submitted along these lines already, with my permission.)