This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen?
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
The iterative design loop hasn’t failed yet.
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headlineanalysisof results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t putthatmuch care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t put that much care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.