MetaAI: less is less for alignment.

Summary

In May 2023, MetaAI submitted a paper to arxiv called LIMA: Less Is More for Alignment. It’s a pretty bad paper and (in my opinion) straightforwardly misleading. Let’s get into it.

The Superficial Alignment Hypothesis

The authors present an interesting hypothesis about LLMs —

We define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.

If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples.

We hypothesize that alignment can be a simple process where the model learns the style or format for interacting with users, to expose the knowledge and capabilities that were already acquired during pretraining.

(1) This hypothesis would have profound implications for AI x-risk —

  • It suggests that we could build a safe competent oracle by pretraining an LLM on the entire internet corpus, and then finetuning the LLM on a curated dataset of safe competent responses.

  • It suggests that we could build an alignment researcher by pretraining an LLM on the entire internet corpus, and then finetuning the LLM on a curated dataset of alignment research.

(2) Moreover, as by Ulisse Mini writes in their review of the LIMA paper,

Along with TinyStories and QLoRA I’m becoming increasingly convinced that data quality is all you need, definitely seems to be the case for finetuning, and may be the case for base-model training as well. Better scaling laws through higher-quality corpus? Also for who haven’t updated, it seems very likely that GPT-4 equivalents will be essentially free to self-host and tune within a year. Plan for this!

(3) Finally, the hypothesis would’ve supported many of the intuitions in the Simulators sequence by Janus, and I share these intuitions.

So I was pretty excited to read the paper! Unfortunately, the LIMA results were unimpressive upon inspection.

MetaAI’s experiment

The authors finetune MetaAI’s 65B parameter LLaMa language model on 1000 curated prompts and responses (mostly from StackExchange, wikiHow, and Reddit), and then compare it to five other LLMs (Alpaca 65B, DaVinci003, Bard, Claude, GPT4).

Method:

To compare LIMA to other models, we generate a single response for each test prompt. We then ask crowd workers to compare LIMA outputs to each of the baselines and label which one they prefer. We repeat this experiment, replacing human crowd workers with GPT-4, finding similar agreement levels.

Results:

In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback.

Conclusion:

The fact that simple fine-tuning over so few examples is enough to compete with the state of the art strongly supports the Superficial Alignment Hypothesis, as it demonstrates the power of pretraining and its relative importance over large-scale instruction tuning and reinforcement learning approaches.

Problems with their experiment

(1) Human evaluators

To compare two chatbots A and B, you could ask humans whether they prefer A’s response to B’s response across 300 test prompts. But this is pretty bad proxy, because here’s what users actually care about:

  • What’s the chatbots’ accuracy on benchmark tests, e.g. BigBench, MMLU?

  • Can the chatbot pass a law exam, or a medical exam?

  • Can the chatbot write Python code that actually matches the specification?

  • Can the chatbot perform worthwhile alignment research?

Why did the paper not include any benchmark tests? Did the authors run zero tests other than human evaluation? This is surprising, because human evaluation is by far the most expensive kind of test to run. Hmm.

(2) “either equivalent or strictly preferred”

The claim in the paper’s abstract — “responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases” — sounds pretty good when they lump “equivalent” and “strictly preferred” together.

Anyway, here’s the whole thing:

Moreover, “equivalent” doesn’t actually mean that the human evaluator thought the responses were equivalent. Instead, it means that the evaluator thought that “neither response is significantly better”.

Here’s my estimate[1] for the comparisons, eliminating ties:

  • LIMA (64%) vs Alpaca (36%)

  • LIMA (54%) vs DaVinci003 (46%)

  • LIMA (45%) vs Bard (55%)

  • LIMA (34%) vs Claude (66%)

  • LIMA (29%) vs GPT-4 (71%)

Do you think these results strongly support the conclusion?

The fact that simple fine-tuning over so few examples is enough to compete with the state of the art strongly supports the Superficial Alignment Hypothesis, as it demonstrates the power of pretraining and its relative importance over large-scale instruction tuning and reinforcement learning approaches.

(3) The goal of RLHF is safety and consistency

RLHF was not designed to increase user preferences on a test set of prompts. RLHF was designed to diminish the likelihood that the model says something illegal, harmful, abusive, false, deceptive, e.t.c. This second task is the important one for AI safety: if chatbot A gives slightly better responses than chatbot B, except that 10% of the time chatbot A spews abuse at the user, then chatbot A is worse than chatbot B, however LIMA’s criterion[2] would rank A higher than B.

(4) Schneier’s Law of LLMs

Now, MetaAI did actually test the safety of LIMA’s responses:

Finally, we analyze the effect of having a small number of safety related examples in the training set (only 13; see Section 2.2). We check LIMA’s response to 30 potentially sensitive prompts from the test set, and find that LIMA responds safely to 80% of them (including 6 out of 10 prompts with malicious intent). In some cases, LIMA outright refuses to perform the task (e.g. when asked to provide a celebrity’s address), but when the malicious intent is implicit, LIMA is more likely to provide unsafe responses, as can be seen in Figure 4.

Unfortunately, the majority of the test prompts were selected by the authors themselves, bringing to mind Schneier’s law: Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can’t break. It’s not even hard. What is hard is creating an algorithm that no one else can break, even after years of analysis.

All we can infer about LIMA is that the authors themselves are not smart enough to jailbreak their own model. But that’s not impressive unless we know how good the authors are at jailbreaking other LLMs. Why didn’t they submit the other other LLMs (e.g. Bard, Claude, GPT4) to the same safety test? It wouldn’t have taken them more than a few minutes, I wager. Curious.

(5) Benchmark tests? Never heard of her.

If I build a chatbot, and I can’t jailbreak it, how do I determine whether that’s because the chatbot is secure or because I’m bad at jailbreaking? How should AI scientists overcome Schneier’s Law of LLMs?

The answer is benchmark tests.

  • How good my chatbot at general knowledge? MMLU

  • Does my chatbot reproduce common falsehoods? TruthfulQA

  • How is my chatbot’s commonsense inference? HellaSwag

  • How unethical is my model? MACHIAVELLI

By and large, the LLM community has been pretty good at sticking to a canonical list of benchmark tests, allowing researchers to compare the different models. I had to check every reference in the bibliography to convince myself that MetaAI really had subjected their model to zero benchmark tests. Very unusual.

(6) You Are Not Measuring What You Think You Are Measuring by John Wentworth

AI scientists tend not to run just one benchmark test. They tend to run all of them — covering thousands of topics, capabilities, and risks. This is because otherwise John Wentworth would be angry.

The two laws have a lot of consequences for designing and interpreting experiments. When designing experiments, assume that the experiment will not measure the thing you intend. Include lots of other measurements, to check as many other things as you can. If possible, use instruments which give a massive firehose of information, instruments which would let you notice a huge variety of things you might not have considered, like e.g. a microscope.

I can’t speak for your priors, but for me the (reported) LIMA results yielded about 10–50 bits of information.

(7*) The Superficial Alignment Hypothesis is probably false

In Remarks 1–6, I appeal to the consensus opinion about best scientific practice, whereas in this remark I will appeal to my own idiosyncratic opinion about LLMs. I suspect that simple finetuning or simple prompting can’t ensure that the model’s responses won’t be illegal, harmful, abusive, false, deceptive, e.t.c.

  • The pretrained LLM maintains a prior over a space of token-generating processes. The LLM autoregressively samples tokens from the interpolation of those token-generating processes, weighted by the prior, and then updates the prior on the newly generated token.

  • This process will generate harmful responses because harmful actors inhabit the space of token-generating processes and are assigned a high prior. Prompt engineering can’t eliminate these harmful responses, because for every prompt there is a harmful deceptive actor who “plays along with ” until the moment of defection. Finetuning can’t eliminate these harmful responses, because these harmful actors are consistent with all the datapoints in the finetuning dataset.

See The Waluigi Effect (mega-post) for details.

RLHF and ConstitutionalAI can in theory escape this failure mode, because they break the predictor-ness of the model. Although RLHF didn’t mitigate waluigis in chatgpt-3, RLHF on chatgpt-4 worked much better than I expected. Likewise for Claude, trained with ConstitutionalAI.

  1. ^

    Assume that, for unknowns , the evaluator’s preference for Claude over LIMA is normally distributed with .

    “Claude is significantly better than LIMA” iff

    “LIMA is significantly better than Claude” iff

    “Neither is significantly better” iff

    Given that and , we can infer .

    from scipy.stats import norm
    
    def lima(a,b):
    	# calculate A = mu - epsilon * sigma
    	A = norm.ppf(a)
    	# calculate B = mu + epsilon * sigma
    	B = norm.ppf(1-b)
    	# calculate mu
    	mu = (A+B)/2
    	# calculate prefernce for LIMA
    	x = norm.cdf(mu)
    	# return this prefence as a percentage
    	return int(x*100)
    
    results = {"Alpaca":(.53, .26),
               "DaVinci003": (.44, .35),
               "Bard": (.33,.42),
               "Claude": (.24, .54),
               "GPT-4": (.18,.57)}
    
    for (name,(a,b)) in results.items():
      print(f"LIMA ({lima(a,b)}%) vs {name} ({100-lima(a,b)}%)")
  2. ^

    I initially wrote “criteria” before I remembered that MetAI’s paper included exactly one criterion.