Tests make creating AI hard

In my post on incentive structures I gave an potted summary of how to do incentive structures better when what you are trying to achieve is ill defined,

Improve Models using Measures, use the Model to update Targets.

I would add,

Try to hit Targets. Avoid Tests.

In this post I will recap what I mean by the above, give an abridged history of the field of AI and how it has failed to follow this maxim well and what following it might have looked like.

If we want to create safe AI we need good incentive structures for the people researching it, we need to get good at this.

Recap

A Model is your causal view of the thing you are trying to achieve. A Measure is something you can apply to your system to let you know if you are going in the right direction. Targets are things you are trying to do in the world. A Test in this formalism is something that is a Measure and Target all in one, you don’t need a Model, if you do well at the Test you are doing well.

One of the key differences between a Measure and Test, if you do better than a Model predicts on a Measure you should change your Model (and maybe change your Target). If you do better than you expect on a Test, there is no need to change anything.

Measure 1: The Turing test

The most famous measure in AI is the Turing test.

[it] is a test of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another.

From this we got the Loebner prize. This in turn has led to profusion of chat bots like this one. It might entertain the judges and give them a moments pause to try and figure out whether it is a human or not, but you can’t get much real work out of it. It can’t do maths, run a business or teach kids. It is a product of taking the Measure as a Test.

However it is still a good Measure: if something really passed it we would expect it to be intelligent.

Measure 2: Image net

This is not supposed to be a test of full intelligence, but how well an algorithm detects object in images . While it is creating algorithms that do better and better at this all the time, it seems unlikely that it is getting close to how humans do object recognition.

For example the algorithms aren’t being designed to take hints about what is in an image to help it locate the object, because that is not what they are being tested for. Why might you want to do so? Because there are lots of good examples humans being able to process these hints. This seems like a useful thing to do. You can experience the power of “hint taking” first hand, if you read this slatestarcodex article.

If your Measure and your Target is static, you are just trying to make a slightly better algorithm that does better at these Tests. You are not going to change your Model of image recognition to be able to incorporate other types of data, like “hints”.

Breaking the Test cycle

So we are in general teaching our system to the tests. And when the the Tests are iterated upon (because they don’t get what people actually want), the people trying to iterate the Tests get accused of moving the goal posts. There is even a name for it, the AI effect.

If we are to get safe general AI or IA we need to break this cycle.
We need a Model of what intelligence is and iterate on that, so that we can get different Targets. Not naively try to meet the Tests better or add more and more Tests.

I obviously think separating Measures and Metrics will help us with making safe AI, so how can we

Improve Models using the Measure, use the Model to update Targets. Hit Targets, Avoid Tests.

The following an illustrative alternative history about the development AI could’ve gone gone if we had better separated our Targets from our Measures. It will be described by a series of rounds. Each round will have an assumed Model, a Target to try and create due to that model, the Measure (in this case the Turing test, but you could have more measures) and a result of having performed that measurement. There will also be a failure mode to avoid in the round.

Round 1:

Model: Let us start with a model of human conversation as a fixed input/​output mapping, for simplicity’s sake.

Target: Create a mapping of input to output that is nice to talk to.
Measure: Use the turing test as a measure.

Result: This isn’t very good to talk to. It is the same every time. Humans aren’t the same every time. Update the model of an intelligence so that it isn’t seen as a fixed mapping.

Failure mode: Create ever more elaborate mappings from input to output, include previous parts of the conversation in the input.

Round 2

Update Model: Humans seem to learn over time. So let us assume an intelligence also learns.

Target: Create a machine learning system that tries to find a mapping from input to output from data. Provide the system with the data from previous conversations and how well the judges like those conversations. So that it can update it’s mapping from input to output.

Measure: Use the turing test as a measure.

Result: It was pleasant to talk to but one judge tried to teach it simple mathematics (adding two numbers together) and it failed. There is no way our current model of an intelligence could learn mathematics in a single conversation.

Failure Mode: Throw more and more data and processing power at it, creating ever more complex mappings

Round 3

Update Model: Humans seem to be able to treat natural language as programs to be interpreted and compiled. They learn languages by this using large amounts of data, but they also use language to help learn other bits of language.

Target: Create some system that can discern good programs from bad programs, then put programs in it that try to compile human language into novel programs. Also have programs that search for novel patterns in the input language and tries to hook them up to code generation.

Measure: Use the turing test as a measure.

Result:??

Conclusion

I hope I have shown you how the maxim might be useful for thinking about incentive structures for solving a complex problem that we don’t quite understand what we are aiming for. While this is just my initial Model of what we should do for incentive structures for human researchers, it is very important to get right (as it could be used inside AI as well as how to build it).
So to move forward on the intelligence problem we need to create the best model of intelligence we can, so we can find targets. I think I have one (Round 3), but I would I love to get more input into it.

It would also be interesting to think about when having a Model would be a bad idea and it would better to just use Tests. There are probably some circumstances.