Zeroth point: under a Doomimir-ish view, the “modelling the human vs modelling in a similar way to the human” frame is basically right for current purposes, so no frame clash.
On to the main response...
Doomimir: This isn’t just an “in the limit” argument. “I’ll give you $100 if you classify this as an apple” → (predict apple classification) is not some incredibly high-complexity thing to figure out. This isn’t a jupiter-brain sort of challenge.
For instance, anything with a simplicity prior at all similar to humans’ simplicity prior will obviously figure it out, as evidenced by the fact that humans can figure out hypotheses like “it’s bribing the classifier” just fine. Even beyond human-like priors, any ML system which couldn’t figure out something that basic would apparently be severely inferior to humans in at least one very practically-important cognitive domain.
Even prior to developing a full-blown model of the human rater, models can incrementally learn to predict the systematic errors in human ratings, and we can already see that today. The classic case of the grabber hand is a go-to example:
(A net learned to hold the hand in front of the ball, so that it looks to a human observer like the ball is being grasped. Yes, this actually happened.)
… and anecdotally, I’ve generally heard from people who’ve worked with RLHF that as models scale up, they do in fact exploit rater mistakes more and more, and it gets trickier to get them to do what we actually want. This business about “The technology in front of us really does seem like it’s ‘reasoning with’ rather than ‘reasoning about’” is empirically basically false, and seems to get more false in practice as models get stronger even within the current relatively-primitive ML regime.
So no, this isn’t a “complicated empirical question” (or a complicated theoretical question). The people saying “it’s a complicated empirical question, we Just Can’t Know” are achieving their apparent Just Not Knowing by sticking their heads in the sand; their lack of knowledge is a fact about them, not a fact about the available evidence.
(I’ll flag here that I’m channeling the character of Doomimir and would not necessarily say all of these things myself, especially the harsh parts. Happy to play out another few rounds of this, if you want.)
Simplicia: I think it’s significant that the “hand between ball and camera” example from Amodei et al. 2017 was pure RL from scratch. You have a function π that maps observations (from the robot’s sensors) to actions (applying torques to the robot’s joints). You sample sequences of observation–action pairs from π and show them to a human, and fit a function r̂ to approximate the human’s choices. Then you use Trust Region Policy Optimization to adjust π to score better according to r̂. In this case, TRPO happened to find something that looked good instead of being good, in a way that r̂ wasn’t able to distinguish. That is, we screwed up and trained the wrong thing. That’s a problem, and the severity of the problem would get worse the more capable π was and the more you were relying on it. If we were going to produce powerful general AI systems with RL alone, I would be very nervous.
But the reason I’m so excited about language models in particular is that their capabilities seem to mostly come from unsupervised pre-training rather than RLHF. You fit a function to the entire internet first, and only afterwards tweak it a bit so that its outputs look more like obeying commands rather than predicting random internet tokens—where the tweaking process incorporates tricks like penalizing the Kullback–Leibler divergence from the reward model’s training distribution, such that you’re not pulling the policy too far away from the known-safe baseline.
But the fact that GPT-4 can do that seems like it’s because that kind of reasoning appears on the internet, which is what I mean by the claim that contemporary systems are “reasoning with” rather than “reasoning about”: the assistant simulacrum being able to explain bribery when prompted isn’t the same thing as the LM itself trying to maximize reward.
I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes. What did those entail, exactly? (To the extent that we lack critical evidence about this potential alignment failure because the people who experienced it are gagged by an NDA, that seems like a point in favor of sharing information about language model capabilities.)
I certainly expect some amount of sycophancy: if you sample token completions from your LM, and then tweak its outputs to be more like what your raters want to hear, you end up with an LM that’s more inclined to say what your raters want to hear. Fine. That’s a problem. Is it a fatal problem? I mean, if you don’t try to address it at all and delegate all of your civilization’s cognition to machines that don’t want to tell you about problems, then eventually you might die of problems your AIs didn’t tell you about.
But “mere” sycophancy sounds like a significantly less terrifying failure mode than reward hacking of the sort that would result in things like the LM spontaneously trying to threaten or bribe labelers. That would have a large KL divergence from the policy you started with!
Doomimir: I’ll summarize the story you seem excited about as follows:
We train a predictive model on The Whole Internet, so it’s really good at predicting text from that distribution.
The human end-users don’t really want a predictive model. They want a system which can take a natural-language request, and then do what’s requested. So, the humans slap a little RL (specifically RLHF) on the predictive model, to get the “request → do what’s requested” behavior.
The predictive model serves as a strong baseline for the RL’d system, so the RL system can “only move away from it a little” in some vague handwavy sense. (Also in the KL divergence sense, which I will admit as non-handwavy for exactly those parts of your argument which you can actually mathematically derive from KL-divergence bounds, which is currently zero of the parts of your argument.)
The “only move away from The Internet Distribution a little bit” part somehow makes it much less likely that the RL’d model will predict and exploit the simple predictable ways in which humans rate things. As opposed to, say, make it more likely that the RL’d model will predict and exploit the simple predictable ways in which humans rate things.
There’s multiple problems in this story.
First, there’s the end-users demanding a more agenty system rather than a predictor, which is why people are doing RLHF in the first place rather than raw prompting (which would be better from a safety perspective). Given time, that same demand will drive developers to make models agentic in other ways too (think AgentGPT), or to make the RLHF’d LLMs more agentic and autonomous in their own right. That’s not the current center of our discussion, but it’s worth a reminder that it’s the underlying demand which drives developers to choose more risky methods (like RLHF) over less risky methods (like raw predictive models) in the first place.
Second, there’s the vague handwavy metaphor about the RL system “only moving away from the predictive model a little bit”. The thing is, we do need more than a handwavy metaphor! “Yes, we don’t understand at the level of math how making that KL-divergence small will actually impact anything we actually care about, but my intuition says it’s definitely not going to kill everyone. No, I haven’t been able to convince relevant experts outside of companies whose giant piles of money are contingent on releasing new AI products regularly, but that’s because they’re not releasing products and therefore don’t have firsthand experience of how these systems behave. No, I’m not willing to subject AI products to a burden-of-proof before they induce a giant disaster” is a non-starter even if it turns out to be true.
Third and most centrally to the current discussion, there’s still the same basic problem as earlier: to a system with priors instilled by The Internet, [“I’ll give you $100 if you classify this as an apple” → (predict apple classification)] is still a simple thing to learn. It’s not like pretraining on the internet is going to make the system favor models which don’t exploit the highly predictable errors made by human raters. If anything, all that pretraining will make it easier for the model to exploit raters. (And indeed, IIUC that’s basically what we see in practice.)
As you say: the fact that GPT-4 can do that seems like it’s because that kind of reasoning appears on the internet.
(This one’s not as well-written IMO, it’s mashing a few different things together.)
Simplicia: Where does “empirical evidence” fall on the sliding scale of rigor between “handwavy metaphor” and “mathematical proof”? The reason I think the KL penalty in RLHF setups impacts anything we care about isn’t mostly because the vague handwaving sounds plausible, but because of data such as that presented in Fig. 5 of Stiennon et al. 2020. They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve, where straying from the base policy a little is good, but straying farther is increasingly bad, as the system overoptimizes on what looks good to the reward model, which was only a proxy for the true goal.
(You can see examples of the overoptimized summaries in Table 29 on the last page of the paper. Apparently the reward model really liked tripled question marks and the phrase “pls halp”??? I weakly suspect that these are the kind of “weird squiggles” that would improve with scaling up the reward model, similarly to how state-of-the-art image generators lack the distortions and artifacts of their compute-impoverished predecessors. The reward model in this experiment was only 1.3 billion parameters.)
I’m sure you’ll have no trouble interpreting these results as yet another portent of our impending deaths. We were speaking theoretically about AIs exploiting the Goodhart problem between human ratings and actual goodness, but practical RLHF systems aren’t actually sample-efficient enough to solely use direct human feedback, and have an additional Goodhart problem between reward model predictions of human ratings, and actual ratings. Isn’t that worse? Well, yes.
But the ray of hope I see here is more meta and methodological, rather than turning on any one empirical result. It’s that we have empirical results. We can study these machines, now, before their successors are powerful enough to kill us. The iterative design loop hasn’t failed yet. That can’t last forever—at some point between here and the superintelligence at the end of time, humans are going to be out of the loop. I’m glad people are doing theory trying to figure out what that looks like and how it could be arranged to go well.
But I’m worried about ungrounded alignment theorizing failing to make contact with reality, sneeringly dismissing geniunely workable designs as impossible by appealing to perfectly antisphexish consequentialists on a frictionless plane, when some amount of sphexishness and friction is a known factor of the algorithms in question.
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen? I mean, like, literal threats and promises. You mention rumors from a DeepMind employee about the larger Gemini models being hard to train, but without more details, I’m inclined to guess that that was “pls halp”-style overoptimization rather than the kind of power-seeking or deceptive alignment that would break the design loop. (Incidentally, Gao et al. 2022 studied scaling laws for reward model overoptimization and claimed that model size basically didn’t matter? See §4.4, “Policy size independence”.)
What’s going on here? If I’m right that GPT-4 isn’t secretly plotting to murder us, even though it’s smart enough to formulate the idea and expected utility maximizers have a convergent incentive to murder competitors, why is that?
Here’s my view: model-free reinforcement learning algorithms such as those used in RLHF tweak your AI’s behavior to be more like the behavior that got reward in the past, which is importantly different from expected utility maximization. To the extent that you succeed in rewarding Honest, Helpful, and Harmless behavior in safe regimes, you can plausibly get a basically HHH AI assistant that generalizes to not betraying you when it has the chance, similar to how I don’t do heroin because I don’t want to become a heroin addict—even though if I did take heroin, the reinforcement from that would make me more likely to do it again. Then the nature of the game is keeping that good behavior “on track” for as long as we can—even though the superintelligence at the end of time is presumably be going to do something more advanced than model-free RL. It’s possible to screw up and reward the wrong thing, per the robot hand in front of the ball—but if you don’t screw up early, your basically-friendly-but-not-maximally-capable AIs can help you not screw up later. And in the initial stages, you’re only fighting gradient descent, not even an AGI.
More broadly, here’s how I see the Story of Alignment so far. It’s been obvious to sufficiently penetrating thinkers for a long time that the deep future belongs to machine intelligence—that, as George Elliot put it in 1879, “the creatures who are to transcend and finally supersede us [will] be steely organisms, giving out the effluvia of the laboratory, and performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy.”
What’s less obvious is how much control we can exert over how that goes by setting the initial conditions. Can we arrange for the creatures who are to transcend and finally supersede us to be friendly and create the kind of world we would want, or will they murder us and tile the universe with something random?
Fifteen years ago, the problem looked hopeless, just from considering the vast complexity of human values. How would you write a computer program that values “happiness”, “freedom”, or “justice”, let alone everything else we want? It wasn’t clear how to build AI at all, but surely it would be easier to build some AI than a good AI. Humanity was doomed.
But now, after the decade of deep learning, the problem and its impossible solution seem to be arriving closer together than I would have ever dreamt. Okay, we still don’t know how to write down the human utility function, to be plugged in to an arbitrarily powerful optimizer.
But it’s increasingly looking like value isn’t that fragile if it’s specified in latent space, rather than a program that breaks if a single character is wrong—that there are ways to meaningfully shape the initial conditions of our world’s ascension that don’t take the exacting shape of “utility function + optimizer”.
We can leverage unsupervised learning on human demonstration data to do tasks the way humans do them, and we can use RLHF to elicit behavior we want in situations where we can’t write down our desires as an explicit reward or utility function. Crucially, by using these these techniques together to compensate for each other’s safety and capability weaknesses, it seems feasible to build AI whose effects look “like humans, but faster”: performing with infallible exactness everything that we would have performed with a slovenly approximativeness and self-defeating inaccuracy. That doesn’t immediately bring about the superintelligence at the end of time—although it might look pretty immediate in sidereal time—but seems like a pretty good way to kick off our world’s ascension.
Is this story wrong? Maybe! … probably? My mother named me “Simplicia”, over my father’s objections, because of my unexpectedly low polygenic scores. I am aware of my … [she hesitates and coughs, as if choking on the phrase] learning disability. I’m not confident in any of this.
But if I’m wrong, there should be arguments explaining why I’m wrong—arguments that should convince scientists working in the field, even if I personally am too limited to understand them. I’ve tried to ground my case in my understanding of the state of the art, citing relevant papers when applicable.
In contrast, dismissing the entire field as hopeless on the basis of philosophy about “perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards” isn’t engaging with the current state of alignment, let alone all the further advances that humans and our non-superintelligent AIs will come up with before the end of days! Doomimir Doomovitch, with the fate of the lightcone in the balance, isn’t it more dignified to at least consider the possibility that someone else might have thought of something? Reply! Reply!
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen?
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
The iterative design loop hasn’t failed yet.
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headlineanalysisof results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t putthatmuch care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
Putting “the burden of proof” aside, I think it would be great if someone stated more or less formally what evidence moves them how much toward which model. Because “pretraining makes it easier to exploit” is meaningless without numbers: the whole optimistic point is that it’s not overwhelmingly easier (as evident by RLHFed systems not always exploiting users) and the exploits become less catastrophic and more common-sense because of pretraining. So the question is not about direction of evidence, but whether it can overcome the observation that current systems mostly work.
“Current systems mostly work” not because of RLHF specifically, it’s because we are under conditions where iterative design loop works, i.e., mainly, if our system is not aligned, it doesn’t kill us, so we can continue iterating until it has acceptable behaviour.
But iterative design works not only because we are not killed—it also wouldn’t work if acceptable behavior didn’t generalize at least somewhat from training. But it does generalize, so it’s possible that iteratively aligning a system under safe conditions would produce acceptable behavior even when as system can kill you. Or what is your evidence to the contrary? Like, does AutoGPT immediately kills you, if you connect it to some robot via python?
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.
It’s asymmetric: it blows up when the data is very unlikely according to Q, which amounts to seeing something happen that you thought was nearly impossible, but not when the data is very unlikely according to P, which amounts to not seeing something that you thought was reasonably likely.
We—I mean, not we, but the maniacs who are hell-bent on destroying this world—include a DKL(πRLHF||πbase) penalty term in the RL objective because they don’t want the updated policy to output tokens that would be vanishingly unlikely coming from the base language model.
But your specific example of threats and promises isn’t vanishingly unlikely according to the base model! Common Crawl webtext is going to contain a lot of natural language reasoning about threats and promises! It’s true, in a sense, that the function of the KL penalty term is to “stay close” to the base policy. But you need to think about what that means mechanistically; you can’t just reason that the webtext prior is somehow “safe” in way that means staying KL-close to it is safe.
But you probably won’t understand what I’m talking about for another 70 days.
I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes.
See here. I haven’t dug into it much, but it does talk about the same general issues specifically in the context of RLHF’d LLMs, not just pure-RL-trained models.
(I’ll get around to another Doomimir response later, just dropping that link for now.)
Zeroth point: under a Doomimir-ish view, the “modelling the human vs modelling in a similar way to the human” frame is basically right for current purposes, so no frame clash.
On to the main response...
(I’ll flag here that I’m channeling the character of Doomimir and would not necessarily say all of these things myself, especially the harsh parts. Happy to play out another few rounds of this, if you want.)
Simplicia: I think it’s significant that the “hand between ball and camera” example from Amodei et al. 2017 was pure RL from scratch. You have a function π that maps observations (from the robot’s sensors) to actions (applying torques to the robot’s joints). You sample sequences of observation–action pairs from π and show them to a human, and fit a function r̂ to approximate the human’s choices. Then you use Trust Region Policy Optimization to adjust π to score better according to r̂. In this case, TRPO happened to find something that looked good instead of being good, in a way that r̂ wasn’t able to distinguish. That is, we screwed up and trained the wrong thing. That’s a problem, and the severity of the problem would get worse the more capable π was and the more you were relying on it. If we were going to produce powerful general AI systems with RL alone, I would be very nervous.
But the reason I’m so excited about language models in particular is that their capabilities seem to mostly come from unsupervised pre-training rather than RLHF. You fit a function to the entire internet first, and only afterwards tweak it a bit so that its outputs look more like obeying commands rather than predicting random internet tokens—where the tweaking process incorporates tricks like penalizing the Kullback–Leibler divergence from the reward model’s training distribution, such that you’re not pulling the policy too far away from the known-safe baseline.
I agree that as a consequentialist with the goal of getting good ratings, the strategy of “bribe the rater” isn’t very hard to come up with. Indeed, when I prompt GPT-4 with the problem, it gives me “Offering Incentives for Mislabeling” as #7 on a list of 8.
But the fact that GPT-4 can do that seems like it’s because that kind of reasoning appears on the internet, which is what I mean by the claim that contemporary systems are “reasoning with” rather than “reasoning about”: the assistant simulacrum being able to explain bribery when prompted isn’t the same thing as the LM itself trying to maximize reward.
I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes. What did those entail, exactly? (To the extent that we lack critical evidence about this potential alignment failure because the people who experienced it are gagged by an NDA, that seems like a point in favor of sharing information about language model capabilities.)
I certainly expect some amount of sycophancy: if you sample token completions from your LM, and then tweak its outputs to be more like what your raters want to hear, you end up with an LM that’s more inclined to say what your raters want to hear. Fine. That’s a problem. Is it a fatal problem? I mean, if you don’t try to address it at all and delegate all of your civilization’s cognition to machines that don’t want to tell you about problems, then eventually you might die of problems your AIs didn’t tell you about.
But “mere” sycophancy sounds like a significantly less terrifying failure mode than reward hacking of the sort that would result in things like the LM spontaneously trying to threaten or bribe labelers. That would have a large KL divergence from the policy you started with!
(This one’s not as well-written IMO, it’s mashing a few different things together.)
Simplicia: Where does “empirical evidence” fall on the sliding scale of rigor between “handwavy metaphor” and “mathematical proof”? The reason I think the KL penalty in RLHF setups impacts anything we care about isn’t mostly because the vague handwaving sounds plausible, but because of data such as that presented in Fig. 5 of Stiennon et al. 2020. They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve, where straying from the base policy a little is good, but straying farther is increasingly bad, as the system overoptimizes on what looks good to the reward model, which was only a proxy for the true goal.
(You can see examples of the overoptimized summaries in Table 29 on the last page of the paper. Apparently the reward model really liked tripled question marks and the phrase “pls halp”??? I weakly suspect that these are the kind of “weird squiggles” that would improve with scaling up the reward model, similarly to how state-of-the-art image generators lack the distortions and artifacts of their compute-impoverished predecessors. The reward model in this experiment was only 1.3 billion parameters.)
I’m sure you’ll have no trouble interpreting these results as yet another portent of our impending deaths. We were speaking theoretically about AIs exploiting the Goodhart problem between human ratings and actual goodness, but practical RLHF systems aren’t actually sample-efficient enough to solely use direct human feedback, and have an additional Goodhart problem between reward model predictions of human ratings, and actual ratings. Isn’t that worse? Well, yes.
But the ray of hope I see here is more meta and methodological, rather than turning on any one empirical result. It’s that we have empirical results. We can study these machines, now, before their successors are powerful enough to kill us. The iterative design loop hasn’t failed yet. That can’t last forever—at some point between here and the superintelligence at the end of time, humans are going to be out of the loop. I’m glad people are doing theory trying to figure out what that looks like and how it could be arranged to go well.
But I’m worried about ungrounded alignment theorizing failing to make contact with reality, sneeringly dismissing geniunely workable designs as impossible by appealing to perfectly antisphexish consequentialists on a frictionless plane, when some amount of sphexishness and friction is a known factor of the algorithms in question.
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen? I mean, like, literal threats and promises. You mention rumors from a DeepMind employee about the larger Gemini models being hard to train, but without more details, I’m inclined to guess that that was “pls halp”-style overoptimization rather than the kind of power-seeking or deceptive alignment that would break the design loop. (Incidentally, Gao et al. 2022 studied scaling laws for reward model overoptimization and claimed that model size basically didn’t matter? See §4.4, “Policy size independence”.)
What’s going on here? If I’m right that GPT-4 isn’t secretly plotting to murder us, even though it’s smart enough to formulate the idea and expected utility maximizers have a convergent incentive to murder competitors, why is that?
Here’s my view: model-free reinforcement learning algorithms such as those used in RLHF tweak your AI’s behavior to be more like the behavior that got reward in the past, which is importantly different from expected utility maximization. To the extent that you succeed in rewarding Honest, Helpful, and Harmless behavior in safe regimes, you can plausibly get a basically HHH AI assistant that generalizes to not betraying you when it has the chance, similar to how I don’t do heroin because I don’t want to become a heroin addict—even though if I did take heroin, the reinforcement from that would make me more likely to do it again. Then the nature of the game is keeping that good behavior “on track” for as long as we can—even though the superintelligence at the end of time is presumably be going to do something more advanced than model-free RL. It’s possible to screw up and reward the wrong thing, per the robot hand in front of the ball—but if you don’t screw up early, your basically-friendly-but-not-maximally-capable AIs can help you not screw up later. And in the initial stages, you’re only fighting gradient descent, not even an AGI.
More broadly, here’s how I see the Story of Alignment so far. It’s been obvious to sufficiently penetrating thinkers for a long time that the deep future belongs to machine intelligence—that, as George Elliot put it in 1879, “the creatures who are to transcend and finally supersede us [will] be steely organisms, giving out the effluvia of the laboratory, and performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy.”
What’s less obvious is how much control we can exert over how that goes by setting the initial conditions. Can we arrange for the creatures who are to transcend and finally supersede us to be friendly and create the kind of world we would want, or will they murder us and tile the universe with something random?
Fifteen years ago, the problem looked hopeless, just from considering the vast complexity of human values. How would you write a computer program that values “happiness”, “freedom”, or “justice”, let alone everything else we want? It wasn’t clear how to build AI at all, but surely it would be easier to build some AI than a good AI. Humanity was doomed.
But now, after the decade of deep learning, the problem and its impossible solution seem to be arriving closer together than I would have ever dreamt. Okay, we still don’t know how to write down the human utility function, to be plugged in to an arbitrarily powerful optimizer.
But it’s increasingly looking like value isn’t that fragile if it’s specified in latent space, rather than a program that breaks if a single character is wrong—that there are ways to meaningfully shape the initial conditions of our world’s ascension that don’t take the exacting shape of “utility function + optimizer”.
We can leverage unsupervised learning on human demonstration data to do tasks the way humans do them, and we can use RLHF to elicit behavior we want in situations where we can’t write down our desires as an explicit reward or utility function. Crucially, by using these these techniques together to compensate for each other’s safety and capability weaknesses, it seems feasible to build AI whose effects look “like humans, but faster”: performing with infallible exactness everything that we would have performed with a slovenly approximativeness and self-defeating inaccuracy. That doesn’t immediately bring about the superintelligence at the end of time—although it might look pretty immediate in sidereal time—but seems like a pretty good way to kick off our world’s ascension.
Is this story wrong? Maybe! … probably? My mother named me “Simplicia”, over my father’s objections, because of my unexpectedly low polygenic scores. I am aware of my … [she hesitates and coughs, as if choking on the phrase] learning disability. I’m not confident in any of this.
But if I’m wrong, there should be arguments explaining why I’m wrong—arguments that should convince scientists working in the field, even if I personally am too limited to understand them. I’ve tried to ground my case in my understanding of the state of the art, citing relevant papers when applicable.
In contrast, dismissing the entire field as hopeless on the basis of philosophy about “perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards” isn’t engaging with the current state of alignment, let alone all the further advances that humans and our non-superintelligent AIs will come up with before the end of days! Doomimir Doomovitch, with the fate of the lightcone in the balance, isn’t it more dignified to at least consider the possibility that someone else might have thought of something? Reply! Reply!
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t put that much care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
I mean, in real reality, Sydney literally threatened journalist.
Putting “the burden of proof” aside, I think it would be great if someone stated more or less formally what evidence moves them how much toward which model. Because “pretraining makes it easier to exploit” is meaningless without numbers: the whole optimistic point is that it’s not overwhelmingly easier (as evident by RLHFed systems not always exploiting users) and the exploits become less catastrophic and more common-sense because of pretraining. So the question is not about direction of evidence, but whether it can overcome the observation that current systems mostly work.
“Current systems mostly work” not because of RLHF specifically, it’s because we are under conditions where iterative design loop works, i.e., mainly, if our system is not aligned, it doesn’t kill us, so we can continue iterating until it has acceptable behaviour.
But iterative design works not only because we are not killed—it also wouldn’t work if acceptable behavior didn’t generalize at least somewhat from training. But it does generalize, so it’s possible that iteratively aligning a system under safe conditions would produce acceptable behavior even when as system can kill you. Or what is your evidence to the contrary? Like, does AutoGPT immediately kills you, if you connect it to some robot via python?
My evidence is how it is exactly happening.
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
Examples:
What happened to ChatGPT on release.
What happened to ChatGPT in slightly unusual environment despite all alignment training.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.
Doomimir: No, it wouldn’t! Are you retarded?
Simplicia: [apologetically] Well, actually …
Doomimir: [embarrassed] I’m sorry, Simplicia Optimistovna; I shouldn’t have snapped at you like that.
[diplomatically] But I think you’ve grievously misunderstood what the KL penalty in the RLHF objective is doing. Recall that the Kullback–Leibler divergence DKL(P||Q) represents how surprised you’d be by data from distribution P, that you expected to be from distribution Q.
It’s asymmetric: it blows up when the data is very unlikely according to Q, which amounts to seeing something happen that you thought was nearly impossible, but not when the data is very unlikely according to P, which amounts to not seeing something that you thought was reasonably likely.
We—I mean, not we, but the maniacs who are hell-bent on destroying this world—include a DKL(πRLHF||πbase) penalty term in the RL objective because they don’t want the updated policy to output tokens that would be vanishingly unlikely coming from the base language model.
But your specific example of threats and promises isn’t vanishingly unlikely according to the base model! Common Crawl webtext is going to contain a lot of natural language reasoning about threats and promises! It’s true, in a sense, that the function of the KL penalty term is to “stay close” to the base policy. But you need to think about what that means mechanistically; you can’t just reason that the webtext prior is somehow “safe” in way that means staying KL-close to it is safe.
But you probably won’t understand what I’m talking about for another 70 days.
See here. I haven’t dug into it much, but it does talk about the same general issues specifically in the context of RLHF’d LLMs, not just pure-RL-trained models.
(I’ll get around to another Doomimir response later, just dropping that link for now.)