A philosopher’s critique of RLHF

In the spring, I went to a talk with Brian Christian at Yale. He talked about his book, The Alignment Problem, and then there was an audience Q&A. There was a really remarkable question in that Q&A, which I have transcribed here. It came from the Yale philosophy professor L.A. Paul. I have since spoken to Professor Paul, and she has done some work on AI (and coauthored the paper “Effective Altruism and Transformative Experience”) but my general impression was that she hasn’t yet spent a huge amount of time thinking about AI safety. Partly because of this question, I invited her to speak at the CAIS Philosophy Fellowship, which she will be doing in the spring.

The transcript below really doesn’t do her question justice, so I’d recommend watching the recording, starting at 55 minutes.

During the talk, Brian Christian described reinforcement learning from human feedback (RLHF), specifically the original paper, where a model was trained with a reward signal generated by having humans rate which of two videos of a simulated robot was closer to a backflip. Paul’s question is about this (punctuation added, obviously):

L.A. Paul: So, I found it very interesting, but I’m just not fully understanding the optimistic note you ended on...so, in that example, what was key was that the humans that did the “better” thing, knew what a backflip was. It was something they recognized. It was something they recognized so they could make a judgment. But the real issue for us is recognizing, or for machines is recognizing, entirely new kinds of events, like a pandemic, or a president that doesn’t follow the rule of law, or something interesting called the internet, you know there’s radically new technological advances. And when something like that happens, those rough judgments of “this is better than that”… In other words, those new things: first, we’re terrible at describing them before they come and predicting them. (Although, humans are very good at a kind of one shot learning, so they can make judgments quite quickly. Machines are not like that).

L.A. Paul: Moreover, these better-than judgments that the machine might be relying on could I think quite straightforwardly be invalidated, because everything changes, or deep things change, in all kinds of unexpected ways. That just seems to be...that’s the real problem. It’s not… using machines for things that we already have control over. No, it’s about trust with entirely new categories of events. So, I was just sort of deeply unclear on… I mean that seems like a nice thing...but that’s not, for me, the real alignment problem.

Brian Christian: [Agrees, and then talks about calibrated uncertainty in models.]

L.A. Paul: Sorry, there’s a difference between uncertainty, where you’re not sure if it’s A, B, or C, and unknown, OK, which is a different kind of uncertainty in probabilistic literature. And then you haven’t got, oh is it A, is it B, is it C? It’s some other kind of thing that you can’t classify and that’s the problem I’m trying to target.

I’m not claiming that these are original ideas or that they represent all possible critiques of RLHF. Rather:

  • I think that the phrasing is especially crisp (for speaking, I’m sure she could write it even more crisply).

  • I think it’s interesting that somebody who is very intelligent and accomplished in philosophy but is not (I think) steeped in the alignment literature could seemingly easily carve this problem at its joints.

Also, the rest of the talk is pretty good too! Especially the Q&A, there were some other pretty good questions (including one from myself). But this one stood out for me.