Potential Alignment mental tool: Keeping track of the types

Epistemic status: Came up with what I was thinking of as a notation. Realised that there was a potentially useful conceptual tool behind it.

Consider this diagram.

1) A diagram showing a useless alignment scheme (ignore the computer, make the human do everything.)

Here the green circle on the left represents humans, with V being human values. The W is the external world. The C is within the computer. The W’ is a world model. And the V’ is the human values in the world model.

Here the computer has been programmed to form a model of the environment, for example Solomonoff induction or GPT3.

Anything in C is stored on the hard drive. Anything in W’ is mirroring a section of the real world and is part of an arbitrary Turing machine or incomprehensible neural network.

The black bars are bridges, ways that something can get from one domain to another. (yes this is basically a graph structure.)

The left most bridge is between the human mind and the world. It is made of eyes and muscles.

The next bridge is between the world and the computer. It is made of keyboards and screens.

The next barrier is the interface between the programmed and the learned. The piece of code that forms 1 hot vectors from a string of text and the piece of code that turns the final activations into a choice of next word. The final barrier is the mirror of the first barrier within the simulation. A virtual representation of keyboards and screens, hidden in the weights of a network.

The dotted arrow represents the path that human values take to get into V’.

We want human values to get out into the world. The easiest way to do that is the solid line. It shows a human acting on the environment directly.

2) Another diagram showing a trivial and useless alignment scheme.

In this setup, the data goes into the computer, and straight back out. It could represent the computer trivially echoing the input. It is easy to add numbers, sort them, loop, take a maximum of a list, etc in the computer section C. So diagram 2 could also represent a calculator. Or any other program not involving AI.

3) Finally, something useful. (Human imitation)

Here is a diagram that shows results flowing all the way from V’ back to the world. It represents something like GPT3 or other human imitation based alignment.

And finally, HCH, IDA can be represented by

This diagram shows such a setup, with values flowing down the dotted lines, and actions flowing down the solid lines. The $⨁$ represents the combination process.

Now to categorize alignment approaches. There are the approaches like IDA, possibly debate etc. that try to build an aligned AI out of these available components.

There are interpretability approaches trying to build extra bridges from the red to the blue or green. (Amongst other potential places a bridge could go.)

One approach could be to investigate what other colours of circles can be made, and their bridges.

There may be approaches to AI that can’t be productively considered in terms of these components and interfaces. For linear regression, the parameters are sufficiently simple that everything represented is a very simplistic model of reality indeed.

A robot that uses hard coded A* search doesn’t really fit this structure. Nor does a hard coded min max chess algorithm.

For AIXI and similar algorithms, you can do the “consider all possible X to maximize Y”, so long as X and Y are in the red compute circle.

I think this technique is a useful conceptual tool because several of the badly considered AI ideas I have seen have a stage where some piece of information magically jumps from one region to another. I don’t know of the serious alignment researchers making this mistake. I have seen criticisms that can be described as “how does this piece of info get between these regions”. So presumably this thought pattern is already being implicitly used by some people. I hope that making the structure explicit helps improve this cognitive step.

With thanks to Miranda and Evan R Murphy for proofreading.