How to get value learning and reference wrong

Six months ago, I thought I was on to something interesting involving value learning and philosophy of reference. Then I had a series of breakthroughs—or do you still call it a breakthrough when it reveals that you were on the wrong track? Reverse breakthrough? How about “repair-around.” Anyhow, this is my attempt to turn the unpublished drafts into a story about their own failure.

One post in this vein did get published, Can few-shot learning teach AI right from wrong? It was about the appeal and shortcomings of what Abram Demski titled model-utility learning:

This AI is made of three parts:
● An unsupervised learning algorithm that learns a model of the world and rules for predicting the future state of the model.
● A supervised algorithm that takes some labeled sensory examples of good behavior, plus the model of the world learned by the unsupervised algorithm, and tries to classify which sequences of states of the model are good.
● To take actions, the AI just follows strategies that result in strongly classified-good states of its predictive model.
We get a picture of an AI that has a complicated predictive model of the world, then tries to find the commonalities of the training examples and push the world in that direction. What could go wrong?
[...]
The thing it learns to classify is simply not going to be what we wanted it to learn. We’re going to show it examples that, from our perspective, are an extensional definition of satisfying human values. But the concept we’re trying to communicate is a very small target to hit, and there are many other hypotheses that match the data about as well.
[...]
Here’s the missing ability to do reference—to go from referring speech-acts to the thing being referred to.

This post had its good points. But under the surface, there was a false assumption that the model of value learning and the skill of reference were separable—that the problem was understanding reference so you could bolt it onto a model-utility learner that lacked it.

That’s what the unpublished second post, Reference as a game, was based on. It’s not like I was blind to all problems—it’s that I didn’t see anything better. And so I pushed forward despite feeling stuck.

In this post I want to talk about doing a little bit better. Naively, “doing better” here means learning the concept the human intended, but this is a tricky demand to unpack. There isn’t necessarily a unique thing that the human “intended,” and what goes on inside the human is very different from what will go on inside the AI.
[...]
I want to temporarily sidestep these issues by thinking of reference in terms of games that the AI can be trained to get a good score on.

The idea examined was whether you could train an AI to play a game where it’s presented with some objects and has to find the human-obvious classification boundary pointed out by those objects. I quite like this passage on romans and code length:

If you’re talking to an ancient roman, you can get them to classify horses with just some gesturing and broken Latin, even though the label “equus” is merely a dozen or so bits of information, and horses are complicated phenomena that it might take a big computer program to classify.
But if you’re trying to communicate the concept of potassium-40, you’ll have to teach your roman a lot of new information about the world before they’ll be able to recognize potassium-40 in a wide array of contexts. You might still be able to explain potassium-40 entirely in words. But then its verbal label is going to be many, many times longer than the label for horses, even though potassium-40 isn’t very physically complicated.
Effectively, your listener’s brain has an encoding for classification rules in which the rule that classifies horses has a short label and the rule that classifies potassium-40 has a long label. We want to make sure our AI game-player is the same way, with a short code for “human-obvious” concepts and a long code for “human-alien” concepts.

The problem remained, though—no matter how good your prior, you’re going to have a bad time if a human-undesirable program reliably fits the data better.

There’s still the issue that, in some sense, you don’t want the predictions to be too good. There’s one explanation that’s the absolute best at categorizing the examples given by the humans, and that’s “the things the humans would give as examples.” And because of human fallibility, this best explanation can be driven apart from what the humans intended. This problem isn’t going to show up until the AI is very smart and operating outside the training conditions.
This means that we are training at the wrong game, even when the assumptions of easy mode are valid. Optimal performance at this game doesn’t translate to good behavior.
The obvious analogy here is to inverse reinforcement learning versus cooperative inverse reinforcement learning. IRL and CIRL are two different games: in the first, the AI assumes that there is a generator of its training data, and tries to copy that generator, while in the second, the AI assumes that its training data is generated by another agent with knowledge about the true reward, and tries to copy the true reward. What we need is an analogy of CIRL.

This sounded really compelling to me. I thought I was going to end this post with a nice result. So I started thinking more about the issue, and you can actually see the paragraph in my notes where I had the repair-around.

To expand on the importance of cooperative learning /​ seeing this as a game: it tries to solve the problem of “look at the moon, not the pointing finger”. The thing generating the examples is a player in the game trying to point to something in your ontology.
If they’re giving examples of strawberries, and not plastic strawberries, that’s a move that indicates that plastic strawberries aren’t right, even though a realistic one could fool the humans. Like CIRL, still has unsolved problems with human error or irrationality, and with identifying the “other player” with the humans rather than the abstract process generating the examples (which might still land us back with learning the category of things the humans think are a strawberry).
[...]
We want to specify humans (a particular part of the world) as the “other player” of the game of reference, but doing that specification is itself a reference!

So I learned about this general obstacle to CIRL and started to slowly realize that I’d been pursuing not quite the right thing.

The AI had to start by trying to model humans, “reference” wasn’t a skill we needed to learn independent of a deep understanding of humans. And so I went back and added more to the third post I’d been working on, Passing the buck to human intuition. Only problem is, I was just going to end up with more repair-arounds. Though I guess I shouldn’t knock repair-arounds—having one means you are less wrong than you were before, after all.

The most charitable thing I can say about this post was that I was trying to work on how an AI could model humans, not as idealized agents + defects, but on the human’s own terms. The biggest problem I met was that this is just really, really hard. I don’t even want to quote too heavily from this one. It’s less interesting to watch me run headfirst into a wall, philosophically speaking. Like:

If we think of the goal as “converting” preferences from the human format to the machine format, it helps to know what the human format is that we think we’re converting from.

This isn’t crazy—it wasn’t a waste of time for me to think about how humans think of preferences in terms of associations, heuristics for judging, and verbal reasoning about the future. But this quote is telling as to how much I was still trying to force everything into the model-utility learner paradigm (word of the day: procrustean), and if I recall correctly, I was taking much too seriously the idea that there’s some “best” model of human desires and internal maps that I had some insights into.

Eventually I was forced to confront the problem that the human models I was dreaming up were too fine-grained to have a basic object that was “the values.”

“Better,” here, means learning the concept the human intended. But this is somewhat ill-defined. Humans don’t have an explicit preference ordering. Not over actions, nor over world-states, nor over world-histories. If you remember the ol’ Sorites paradox, humans don’t have an explicit definition of “heap” either—and what gets judged as a heap can changed based on context. In the same way, it’s not just that the human preference ordering is complicated to write down—what gets judged as good is context-dependent.
In the case of heaps, we think this context-sensitivity is just a normal part of human communication. We aren’t particularly committed to understanding the true essence of heapness and programming it into an AI so that it can judge heaps. But with goodness, we think of goodness as having an essence that needs to be distilled so that we can make accurate judgments of goodness.
One way of thinking about such essences (“abstract objects”?) is that they are members of human mental maps, because that’s how come we can think of them as things. So a better AI, one that is trying to understand our concepts in the sense that we mean them, might have to understand or replicate the human mental map of the world and locate “goodness” within it.
But then, we might ask, how is an AI going to use this human-like map to make decisions? Is it going to make decisions like a human? Because we’ve tried that, and we have complaints. Is it going to cash out goodness as a “stuff” to maximize (like the diamond-maximizer)? As a process to be followed? In either case, the human map is then just a middle-man, and we still need to confront the problem of satisfactorily translating a human concept into the AI’s native format for prediction.

I totally agree, past self—those sure are a bunch of problems. Since I stopped writing that post, I’ve gotten a better handle on how to understand humans (hint: it’s the intentional stance), but I don’t have a much better understanding of turning those human-models into highly optimized plans for the AI. I do have some ideas involving predictive models, though, so maybe it’s time to start writing some more posts, and make some repair-arounds.