Fixed, thanks.
johnswentworth
The Worst Form Of Government (Except For Everything Else We’ve Tried)
The Parable Of The Fallen Pendulum—Part 2
Yeah, that’s right.
The secret handshake is to start with ” is independent of given ” and ” is independent of given ”, expressed in this particular form:
… then we immediately see that for all such that .
So if there are no zero probabilities, then for all .
That, in turn, implies that takes on the same value for all Z, which in turn means that it’s equal to . Thus and are independent. Likewise for and . Finally, we leverage independence of and given :
(A similar argument is in the middle of this post, along with a helpful-to-me visual.)
Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities.
This is easiest to see if we use a “strong invariance” condition, in which each of the must mediate between and . Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure () from any little spatially-localized chunk of the gas (). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn’t imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e. only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.
I agree with this point as stated, but think the probability is more like 5% than 0.1%
Same.
I do think our chances look not-great overall, but most of my doom-probability is on things which don’t look like LLMs scheming.
Also, are you making sure to condition on “scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity”
That’s not particularly cruxy for me either way.
Separately, I’m uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just “light RLHF”.
Fair. Insofar as “scaling up networks, running pretraining + RL” does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.
Solid post!
I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn’t by itself produce a schemer), and I think this is the best write-up of it I’ve seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.
Yup.
Mind sharing a more complete description of the things you tried? Like, the sort of description which one could use to replicate the experiment?
What was your old job?
The Parable Of The Fallen Pendulum—Part 1
Did you see the footnote I wrote on this? I give a further argument for it.
Ah yeah, I indeed missed that the first time through. I’d still say I don’t buy it, but that’s a more complicated discussion, and it is at least a decent argument.
I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I’m open to hearing it.
This is another place where I’d say we don’t understand it well enough to give a good formal definition or operationalization yet.
Though I’d note here, and also above w.r.t. search, that “we don’t know how to give a good formal definition yet” is very different from “there is no good formal definition” or “the underlying intuitive concept is confused” or “we can’t effectively study the concept at all” or “arguments which rely on this concept are necessarily wrong/uninformative”. Every scientific field was pre-formal/pre-paradigmatic once.
To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field’s ability to correctly diagnose bullshit.
That said, I don’t think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.
Man, that is one hell of a bullet to bite. Much kudos for intellectual bravery and chutzpah!
That might be a fun topic for a longer discussion at some point, though not right now.
I would like to see a much more rigorous definition of “search” and why search would actually be “compressive” in the relevant sense for NN inductive biases. My current take is something like “a lot of the references to internal search on LW are just incoherent” and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward “search” in ways that are totally benign.
More generally, I’m quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there’s a double dissociation between these things, where “mechanistic search” is almost always benign, and grabby consequentialism need not be backed by mechanistic search.
Some notes on this:
I don’t think general-purpose search is sufficiently well-understood yet to give a rigorous mechanistic definition. (Well, unless one just gives a very wrong definition.)
Likewise, I don’t think we understand either search or NN biases well enough yet to make a formal compression argument. Indeed, that sounds like a roughly-agent-foundations-complete problem.
I’m pretty skeptical that internal general-purpose search is compressive in current architectures. (And this is one reason why I expect most AI x-risk to come from importantly-different future architectures.) Low confidence, though.
Also, current architectures do have at least some “externalized” general-purpose search capabilities, insofar as they can mimic the “unrolled” search process of a human or group of humans thinking out loud. That general-purpose search process is basically AgentGPT. Notably, it doesn’t work very well to date.
Insofar as I need a working not-very-formal definition of general-purpose search, I usually use a behavioral definition: a system which can take in a representation of a problem in some fairly-broad class of problems (typically in a ~fixed environment), and solve it.
The argument that a system which satisfies that behavioral definition will tend to also have an “explicit search-architecture”, in some sense, comes from the recursive nature of problems. E.g. humans solve large novel problems by breaking them into subproblems, and then doing their general-purpose search/problem-solving on the subproblems; that’s an explicit search architecture.
I definitely agree that grabby consequentialism need not be backed by mechanistic search. More skeptical of the claim mechanistic search is usually benign, at least if by “mechanistic search” we mean general-purpose search (though I’d agree with a version of this which talks about a weaker notion of “search”).
Also, one maybe relevant deeper point, since you seem familiar with some of the philosophical literature: IIUC the most popular way philosophers ground semantics is in the role played by some symbol/signal in the evolutionary environment. I view this approach as a sort of placeholder: it’s definitely not the “right” way to ground semantics, but philosophy as a field is using it as a stand-in until people work out better models of grounding (regardless of whether the philosophers themselves know that they’re doing so). This is potentially relevant to the “representation of a problem” part of general-purpose search.
I’m curious which parts of the Goal Realism section you find “philosophically confused,” because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.
(I’ll briefly comment on each section, feel free to double-click.)
Against Goal Realism: Huemer… indeed seems confused about all sorts of things, and I wouldn’t consider either the “goal realism” or “goal reductionism” picture solid grounds for use of an indifference principle (not sure if we agree on that?). Separately, “reductionism as a general philosophical thesis” does not imply the thing you call “goal reductionism”—for instance one could reduce “goals” to some internal mechanistic thing, rather than thinking about “goals” behaviorally, and that would be just as valid for the general philosophical/scientific project of reductionism. (Not that I necessarily think that’s the right way to do it.)
Goal Slots Are Expensive: just because it’s “generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules” doesn’t mean the end-to-end trained system will turn out non-modular. Biological organisms were trained end-to-end by evolution, yet they ended up very modular.
Inner Goals Would Be Irrelevant: I think the point this section was trying to make is something I’d classify as a pointer problem? I.e. the internal symbolic “goal” does not necessarily neatly correspond to anything in the environment at all. If that was the point, then I’m basically on-board, though I would mention that I’d expect evolution/SGD/cultural evolution/within-lifetime learning/etc to drive the internal symbolic “goal” to roughly match natural structures in the world. (Where “natural structures” cashes out in terms of natural latents, but that’s a whole other conversation.)
Goal Realism Is Anti-Darwinian: Fodor obviously is deeply confused, but I think you’ve misdiagnosed what he’s confused about. “The physical world has no room for goals with precise contents” is somewhere between wrong and a nonsequitur, depending on how we interpret the claim. “The problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter” is correct, but very incomplete as a response to Fodor.
Goal Reductionism Is Powerful: While most of this section sounds basically-correct as written, the last few sentences seem to be basically arguing for behaviorism for LLMs. There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.
This isn’t a proper response to the post, but since I’ve occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:
This post basically-correctly refutes a kinda-mediocre (though relatively-commonly-presented) version of the counting argument.
There does exist a version of the counting argument which basically works.
The version which works routes through compression and/or singular learning theory.
In particular, that version would talk about “goal-slots” (i.e. general-purpose search) showing up for exactly the same reasons that neural networks are able to generalize in the overparameterized regime more generally. In other words, if you take the “counting argument for overfitting” from the post, walk through the standard singular-learning-theory-style response to that story, and then translate that response over to general-purpose search as a specific instance of compression, then you basically get the good version of the counting argument.
Just remembered I walked through basically the good version of the counting argument in this section of What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?
The “Against Goal Realism” section is a wild mix of basically-correct points and thorough philosophical confusion. I would say the overall point it’s making is probably mostly-true of LLMs, false of humans, and most of the arguments are confused enough that they don’t provide much direct evidence relevant to either of those.
Pretty decent post overall.
Edited, thanks.
Third, the nontrivial prediction of 20 here is about “compactly describable errors. “Mislabelling a large part of the time (but not most of the time)” is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you’d have a meaningful boost in generalization error, but that doesn’t happen. Easy Bayes update against #20. (And if we can’t agree on this, I don’t see what we can agree on.)
I indeed disagree with that, and I see two levels of mistake here. At the object level, there’s a mistake of not thinking through the gears. At the epistemic level, it looks like you’re trying to apply the “what would I have expected in advance?” technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)
First, object-level: let’s walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that’s how we usually train them). If we relabel 1′s as 7′s 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its “real underlying uncertainty”, which we’d expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.
What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.
The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky’s #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (… insofar as we should update at all from this experiment, which we shouldn’t very much.)
Second, epistemic-level: my best guess is that you’re ignoring these gears because they’re not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias.
Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. “Which things are we actually measuring?” is itself usually figured out (if it’s figured out at all) by looking at data from the experiment.
Now, this is still compatible with using the “what would I have expected in advance?” technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is “this experiment will mostly measure some random-ass thing which has little to do with the model/theory I’m interested in, and I’ll have to dig through the details of the experiment and results to figure out what it measured”. If one tries to apply the “what would I have expected in advance?” technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.
- ^
Standard disclaimer about guessing what’s going on inside other peoples’ heads being hard, you have more data than I on what’s in your head, etc.
- ^
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen?
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
The iterative design loop hasn’t failed yet.
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t put that much care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
Is the Waldo picture at the end supposed to be Holden, or is that accidental?
The linked abstract describes how
[good generalization] holds across multiple patterns of label noise, even when erroneous labels are biased towards confusing classes.
Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.
The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. Their upper-limit for biased noise made the second-most-probable label equal in probability to the correct one, and in that case the predictor’s generalization accuracy plummeted from near-90% (when the correct label was only slightly more probable than the next-most-probable) to only ~50%.
How this relates to lethality #20: part of what “regular, compactly describable, predictable errors” is saying is that there will be (predictable) cases where the label most probably assigned by a human labeller is not correct (i.e. it’s not what a smart well-informed human would actually want if they had all the relevant info and reflected on it). What the results of the linked paper predict, in that case, is that the net will learn to assign the “incorrect” label—the one which human labellers do, in fact, choose more often than any other. (Though, to be clear, I think this experiment is not very highly relevant one way or the other.)
As for OpenAI’s weak-to-strong results...
I had some back-and-forth about those in a private chat shortly after they came out, and the main thing I remember is that it was pretty tricky to back out the actually-relevant numbers, but it was possible. Going back to the chat log just now, this is the relevant part of my notes:
Rough estimate: on the NLP task the weak model has like 60% accuracy (fig 2).
In cases where the weak model is right, the strong student agrees with it in like 90% of cases (fig 8b). So, on ~6% of cases (10% * 60%), the strong student is wrong by “just being dumb”.
In cases where the weak model is wrong, the strong student’s agreement is very compute-dependent, but let’s pick a middle number and call it 70% (fig 8c). So, on ~28% of cases (70% * 40%), the strong student is wrong by “overfitting to weak supervision”.
So in this particular case, the strong student is wrong about 34% of the time, and 28 of those percentage points are attributable to overfitting to weak supervision.
(Here “overfitting to weak supervision” is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we’re seeing in the weak-to-strong paper is that the strong model learning the weak supervisor’s errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in.
So overall, I definitely maintain that the empirical evidence is solidly in favor of Doomimir’s story here. (And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.)
So that example SWE bench problem from the post:
… is that a prototypical problem from that benchmark? Because if so, that is a hilariously easy benchmark. Like, something could ace that task and still be coding at less than a CS 101 level.
(Though to be clear, people have repeatedly told me that a surprisingly high fraction of applicants for programming jobs can’t do fizzbuzz, so even a very low level of competence would still put it above many would-be software engineers.)