# johnswentworth

Karma: 42,898
• Third, the nontrivial prediction of 20 here is about “compactly describable errors. “Mislabelling a large part of the time (but not most of the time)” is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you’d have a meaningful boost in generalization error, but that doesn’t happen. Easy Bayes update against #20. (And if we can’t agree on this, I don’t see what we can agree on.)

I indeed disagree with that, and I see two levels of mistake here. At the object level, there’s a mistake of not thinking through the gears. At the epistemic level, it looks like you’re trying to apply the “what would I have expected in advance?” technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)

First, object-level: let’s walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that’s how we usually train them). If we relabel 1′s as 7′s 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its “real underlying uncertainty”, which we’d expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.

What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.

The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky’s #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (… insofar as we should update at all from this experiment, which we shouldn’t very much.)

Second, epistemic-level: my best guess is that you’re ignoring these gears because they’re not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias.

Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/​theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/​theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. “Which things are we actually measuring?” is itself usually figured out (if it’s figured out at all) by looking at data from the experiment.

Now, this is still compatible with using the “what would I have expected in advance?” technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is “this experiment will mostly measure some random-ass thing which has little to do with the model/​theory I’m interested in, and I’ll have to dig through the details of the experiment and results to figure out what it measured”. If one tries to apply the “what would I have expected in advance?” technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/​theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.

1. ^

Standard disclaimer about guessing what’s going on inside other peoples’ heads being hard, you have more data than I on what’s in your head, etc.

• This one is somewhat more Wentworth-flavored than our previous Doomimirs.

Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.

On to Doomimir!

We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen?

Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.

Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.

Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.

Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.

Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”

To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:

The iterative design loop hasn’t failed yet.

Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?

Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/​orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.

What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:

• There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/​Deepmind/​Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.

• There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.

• Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.

• On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).

• This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.

• On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.

• On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)

… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.

This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:

They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...

(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.

So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?

(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)

.

.

.

So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.

OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/​labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/​labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t put that much care into it.)

More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.

And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/​government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)

To summarize the main points here:

• Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)

• AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.

• So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/​partially/​not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.

Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/​or progenitors and/​or feedback-generators for future very-superintelligent AIs?

But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.

• Is the Waldo picture at the end supposed to be Holden, or is that accidental?

• The linked abstract describes how

[good generalization] holds across multiple patterns of label noise, even when erroneous labels are biased towards confusing classes.

Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.

The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. Their upper-limit for biased noise made the second-most-probable label equal in probability to the correct one, and in that case the predictor’s generalization accuracy plummeted from near-90% (when the correct label was only slightly more probable than the next-most-probable) to only ~50%.

How this relates to lethality #20: part of what “regular, compactly describable, predictable errors” is saying is that there will be (predictable) cases where the label most probably assigned by a human labeller is not correct (i.e. it’s not what a smart well-informed human would actually want if they had all the relevant info and reflected on it). What the results of the linked paper predict, in that case, is that the net will learn to assign the “incorrect” label—the one which human labellers do, in fact, choose more often than any other. (Though, to be clear, I think this experiment is not very highly relevant one way or the other.)

As for OpenAI’s weak-to-strong results...

I had some back-and-forth about those in a private chat shortly after they came out, and the main thing I remember is that it was pretty tricky to back out the actually-relevant numbers, but it was possible. Going back to the chat log just now, this is the relevant part of my notes:

Rough estimate: on the NLP task the weak model has like 60% accuracy (fig 2).

• In cases where the weak model is right, the strong student agrees with it in like 90% of cases (fig 8b). So, on ~6% of cases (10% * 60%), the strong student is wrong by “just being dumb”.

• In cases where the weak model is wrong, the strong student’s agreement is very compute-dependent, but let’s pick a middle number and call it 70% (fig 8c). So, on ~28% of cases (70% * 40%), the strong student is wrong by “overfitting to weak supervision”.

So in this particular case, the strong student is wrong about 34% of the time, and 28 of those percentage points are attributable to overfitting to weak supervision.

(Here “overfitting to weak supervision” is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we’re seeing in the weak-to-strong paper is that the strong model learning the weak supervisor’s errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in.

So overall, I definitely maintain that the empirical evidence is solidly in favor of Doomimir’s story here. (And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.)

• Fleshing this out a bit more: insofar as development is synchronized in an organism, there usually has to be some high-level signal to trigger the synchronized transitions. Given the scale over which the signal needs to apply (i.e. across the whole brain in this case), it probably has to be one or a few small molecules which diffuse in the extracellular space. As I’m looking into possibilities here, one of my main threads is to look into both general and brain-specific developmental signal molecules in human childhood, to find candidates for the relevant molecular signals.

(One major alternative model I’m currently tracking is that the brain grows to fill the brain vault, and then stops growing. That could in-principle mechanistically work via cells picking up on local physical forces, rather than a small molecule signal. Though I don’t think that’s the most likely possibility, it would be convenient, since it would mean that just expanding the skull could induce basically-normal new brain growth by itself.)

• Doomimir: I’ll summarize the story you seem excited about as follows:

• We train a predictive model on The Whole Internet, so it’s really good at predicting text from that distribution.

• The human end-users don’t really want a predictive model. They want a system which can take a natural-language request, and then do what’s requested. So, the humans slap a little RL (specifically RLHF) on the predictive model, to get the “request → do what’s requested” behavior.

• The predictive model serves as a strong baseline for the RL’d system, so the RL system can “only move away from it a little” in some vague handwavy sense. (Also in the KL divergence sense, which I will admit as non-handwavy for exactly those parts of your argument which you can actually mathematically derive from KL-divergence bounds, which is currently zero of the parts of your argument.)

• The “only move away from The Internet Distribution a little bit” part somehow makes it much less likely that the RL’d model will predict and exploit the simple predictable ways in which humans rate things. As opposed to, say, make it more likely that the RL’d model will predict and exploit the simple predictable ways in which humans rate things.

There’s multiple problems in this story.

First, there’s the end-users demanding a more agenty system rather than a predictor, which is why people are doing RLHF in the first place rather than raw prompting (which would be better from a safety perspective). Given time, that same demand will drive developers to make models agentic in other ways too (think AgentGPT), or to make the RLHF’d LLMs more agentic and autonomous in their own right. That’s not the current center of our discussion, but it’s worth a reminder that it’s the underlying demand which drives developers to choose more risky methods (like RLHF) over less risky methods (like raw predictive models) in the first place.

Second, there’s the vague handwavy metaphor about the RL system “only moving away from the predictive model a little bit”. The thing is, we do need more than a handwavy metaphor! “Yes, we don’t understand at the level of math how making that KL-divergence small will actually impact anything we actually care about, but my intuition says it’s definitely not going to kill everyone. No, I haven’t been able to convince relevant experts outside of companies whose giant piles of money are contingent on releasing new AI products regularly, but that’s because they’re not releasing products and therefore don’t have firsthand experience of how these systems behave. No, I’m not willing to subject AI products to a burden-of-proof before they induce a giant disaster” is a non-starter even if it turns out to be true.

Third and most centrally to the current discussion, there’s still the same basic problem as earlier: to a system with priors instilled by The Internet, [“I’ll give you \$100 if you classify this as an apple” → (predict apple classification)] is still a simple thing to learn. It’s not like pretraining on the internet is going to make the system favor models which don’t exploit the highly predictable errors made by human raters. If anything, all that pretraining will make it easier for the model to exploit raters. (And indeed, IIUC that’s basically what we see in practice.)

As you say: the fact that GPT-4 can do that seems like it’s because that kind of reasoning appears on the internet.

(This one’s not as well-written IMO, it’s mashing a few different things together.)

• I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes.

See here. I haven’t dug into it much, but it does talk about the same general issues specifically in the context of RLHF’d LLMs, not just pure-RL-trained models.

(I’ll get around to another Doomimir response later, just dropping that link for now.)

• Zeroth point: under a Doomimir-ish view, the “modelling the human vs modelling in a similar way to the human” frame is basically right for current purposes, so no frame clash.

On to the main response...

Doomimir: This isn’t just an “in the limit” argument. “I’ll give you \$100 if you classify this as an apple” → (predict apple classification) is not some incredibly high-complexity thing to figure out. This isn’t a jupiter-brain sort of challenge.

For instance, anything with a simplicity prior at all similar to humans’ simplicity prior will obviously figure it out, as evidenced by the fact that humans can figure out hypotheses like “it’s bribing the classifier” just fine. Even beyond human-like priors, any ML system which couldn’t figure out something that basic would apparently be severely inferior to humans in at least one very practically-important cognitive domain.

Even prior to developing a full-blown model of the human rater, models can incrementally learn to predict the systematic errors in human ratings, and we can already see that today. The classic case of the grabber hand is a go-to example:

(A net learned to hold the hand in front of the ball, so that it looks to a human observer like the ball is being grasped. Yes, this actually happened.)

… and anecdotally, I’ve generally heard from people who’ve worked with RLHF that as models scale up, they do in fact exploit rater mistakes more and more, and it gets trickier to get them to do what we actually want. This business about “The technology in front of us really does seem like it’s ‘reasoning with’ rather than ‘reasoning about’” is empirically basically false, and seems to get more false in practice as models get stronger even within the current relatively-primitive ML regime.

So no, this isn’t a “complicated empirical question” (or a complicated theoretical question). The people saying “it’s a complicated empirical question, we Just Can’t Know” are achieving their apparent Just Not Knowing by sticking their heads in the sand; their lack of knowledge is a fact about them, not a fact about the available evidence.

(I’ll flag here that I’m channeling the character of Doomimir and would not necessarily say all of these things myself, especially the harsh parts. Happy to play out another few rounds of this, if you want.)

• Ever since GeneSmith’s post and some discussion downstream of it, I’ve started actively tracking potential methods for large interventions to increase adult IQ.

One obvious approach is “just make the brain bigger” via some hormonal treatment (like growth hormone or something). Major problem that runs into: the skull plates fuse during development, so the cranial vault can’t expand much; in an adult, the brain just doesn’t have much room to grow.

BUT this evening I learned a very interesting fact: ~1/​2000 infants have “craniosynostosis”, a condition in which their plates fuse early. The main treatments involve surgery to open those plates back up and/​or remodel the skull. Which means surgeons already have a surprisingly huge amount of experience making the cranial vault larger after plates have fused (including sometimes in adults, though this type of surgery is most common in infants AFAICT)

.… which makes me think that cranial vault remodelling followed by a course of hormones for growth (ideally targeting brain growth specifically) is actually very doable with current technology.

• Yeah, those were exactly the two videos which most made me think that the model was mostly trained on video game animation. In the tokyo one, the woman’s facial muscles never move at all, even when the camera zooms in on her. And in the SUV one, the dust cloud isn’t realistic, but even covering that up the SUV has a Grand Theft Auto look to its motion.

“Can’t do both complex motion and photorealism in the same video” is a good hypothesis to track, thanks for putting that one on my radar.

• I keep seeing news outlets and the like say that SORA generates photorealistic videos, can model how things move in the real world, etc. This seems like blatant horseshit? Every single example I’ve seen looks like video game animation, not real-world video.

Have I just not seen the right examples, or is the hype in fact decoupled somewhat from the model’s outputs?

• Well, it wasn’t just a temporary bump:

… so it’s presumably also not just the result of pandemic giveaway fraud, unless that fraud is ongoing.

Presumably the thing to check here would be TFP, but Fred’s US TFP series currently only goes to end of 2019, so apparently we’re still waiting on that one? Either that or I’m looking at the wrong series.

• But I’m not sure how to reconcile that with the empirical evidence that deep networks are robust to massive label noise: you can train on MNIST digits with twenty noisy labels for every correct one and still get good performance as long as the correct label is slightly more common than the most common wrong label. If I extrapolate that to the frontier AIs of tomorrow, why doesn’t that predict that biased human reward ratings should result in a small performance reduction, rather than … death?

The conversation didn’t quite get to Doomimir actually answering this part, but I’d consider the standard answer to be item #20 on Eliezer’s List O’Doom:

20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors. To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map—about the environment, not the optimizer—that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

… and yeah, there are definitely nonzero empirical results on that.

• Soooo… they caught and disrupted use by “state-affiliated threat actors” associated with a bunch of countries at odds with the US, but not any of the US’ allies?

What an interesting coincidence.

• Don’t really need comments which are non-obvious to an expert. Part of what makes LLMs well-suited to building external cognitive tools is that external cognitive tools can create value by just tracking “obvious” things, thereby freeing up the user’s attention/​working memory for other things.

• I haven’t experimented very much, but here’s one example prompt.

Please describe what you mentally picture when reading the following block of text:

A Shutdown Problem Proposal

First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:

Does not want to manipulate the shutdown button
Does respond to the shutdown button
Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.

This one produced basically-decent results from GPT-4.

Although I don’t have the exact prompt on hand at the moment, I’ve also asked GPT-4 to annotate a piece of code line-by-line with a Fermi estimate of its runtime, which worked pretty well.