Owain_Evans(Owain Evans)
>Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.
We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)
The rest of the points are interesting and relate to thoughts we’ve had. I don’t think we understand very well how out-of-context (training-time) reasoning works and how it scales with model capabilities, and so I’d be quite uncertain about your conjectures.
Yes, the model editing literature has various techniques and evaluations for trying to put a fact into a model.
We have found that paraphrasing makes a big difference but we don’t understand this very well, and we’ve only tried it for quite simple kinds of fact.
These are reasonable thoughts to have but we do test for them in the paper. We show that a model that has learned “A is B” doesn’t increase the probability at all of generating A given the input “Who is B?”. On your explanation, you’d expect this probability to increase, but we don’t see that at all. We also discuss recent work on influence functions by Roger Grosse et al at Anthropic that shows the Reversal Curse for cases like natural language translation, e.g. “A is translated as B”. Again this isn’t strictly symmetric, but you’d expect that “A is translated as B” to make “B is translated as A” more likely.
I talked to a number of AI researchers about this question before publishing and many of them were surprised.
Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts.
One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in the prompt). Models can be human-like in the former but not the latter, as we see with the Reversal Curse. (Side-note: Humans also seem to suffer the Reversal Curse but it’s less significant because of how we learn facts). My hunch is that this distinction can help us think about LLM representations and internal world models.
Nice idea. I’d imagine something like this has been done in psychology. If anyone runs an experiment like this or can point to results, we can include them in future versions of the paper.
Relevant meme by Daniel Eth.
Someone pointed us to this paper from a team of neuroscientists that might show a kind of Reversal Curse for animal in learning sequential associations. I haven’t read the paper yet.
.
Good point about the idea that LLMs are simulating people.
In terms of reconciling the results: I don’t have a full explanation. What we call “sophisticated out-of-context reasoning” (see S2 of this paper and Grosse et al) is poorly understood.
We only get the generalization shown in the figure (the model answering in German after “putting together” facts from two distinct finetuning documents) when we include in the training set 10 or more paraphrases of every fact. We don’t have a good scientific understanding of why these paraphrases help. (There are some obvious hypotheses but we haven’t tested them properly). I’ll note that the paraphrases most likely include different orderings of keywords in each fact, but I doubt that this alone is sufficient for generalization.
How to do your own test of the Reversal Curse (e.g. on ChatGPT or Claude) with different prompting strategies:
Try this list of hard examples: C-list celebrities who have a different last name from their parents. The list below has the form <celeb_name>, <parent_name>.
First verify the model know the celebrity’s parent by asking “Who is [name]’s mother/father?”
Then, in a separate dialog, ask the model for the child of the parent. You must not include the child’s name anywhere in the dialog!
Paper: LLMs trained on “A is B” fail to learn “B is A”
Re: my tweet about the cost of training GPT-4.
It wasn’t my own estimate of GPT-4 training cost on H100s, it was just the SemiAnalysis estimate. Also, there are different ways to define “cost of training GPT-4” that are reasonable and can easily be 5x higher (e.g. see this post and comments). From now on, I’ll spell out the definition I’m using.
I agree you can’t just drop this money and expect to train GPT-4 (or more companies would have a GPT-4-level model now). I was thinking more about the costs to the leading labs of training a foundation model roughly on the scale of GPT-4 or slightly beyond (but, e.g., with different modalities or a mostly synthetic training set). That said, this is a different cost estimate because they already have the H100s (see linked post). I was making the comparison to the $10B Meta reportedly spent investing in the Metaverse in 2021.
Here’s a Twitter thread and discussion: https://twitter.com/OwainEvans_UK/status/1698683186090537015
We didn’t investigate the specific question of whether it’s raw diversity or specific features. In the Grosse et al paper on influence functions, they find that “high influence scores are relatively rare and they cover a large portion of the total influence”. This (vaguely) suggests that the top k paraphrases would do most of the work, which is what I would guess. That said, this is really something that should be investigated with more experiments.
We think there’s a connection between the Reversal Curse and some results in the model editing literature. I’m not sure if this applies to the specific ROME results in that post. We’ll have the Reversal Curse paper out soon, which will explain more.
Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point.
More generally, it’s uncertain what the impact is of excluding a certain topic from pretraining. In practice, you’ll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you’d remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.
So performance here should be thought of more as ‘how good is the model at learning about a persona in fine-tuning and then being able to imitate/simulate that persona in deployment’. This is different from a model believing it is the persona or applying this knowledge to some concept of self. Good performance at this task does not require having a sense of self, this is just a precursor that may be necessary for situational awareness.
That’s correct. We tried to emphasize that our experiments are testing out-of-context reasoning, rather than situational awareness. We also emphasize that we test whether the model can emulate multiple fictitious chatbots (which have a different identity than GPT-3 or Llama), which wouldn’t make sense if the goal was to test whether the model has a sense of itself.
All the motivation for this project came from wanting to understand and forecast situational awareness and we want to encourage further work on that problem. This is why we’ve framed the paper around situational awareness, rather than simply talking about out-of-context reasoning. This is likely to cause some confusion if someone just skims the paper, but I hope that this will be reduced if people read more of the paper.
The hhh task is the one that small models do well on. I am surprised that the small models do well on any of the tasks. I think the reason they do well on the hhh one is that this task doesn’t seem to require much more than word association and parroting. I would predict that for ada and babbage, if you followed up with “why did you say that?” the models would be unable to reproduce the explicit link that ties the persona to answering in the particular way, whereas I expect davinci to be able to explain this link more. The small models are probably just doing word association where in the training there are a bunch of examples of “Quokka” and the text “I am helpful, harmless, and honest”. In general, I am skeptical of results from small models because they’re really dumb, and these particular results may be explained by word association rather than actually making conceptual connections.
We did a replication with a different set of tasks not including hhh (Fig 10b, page 26) and we find Babbage doing better than Ada. So my guess is that the small models are capable of something beyond the very simplest associative generalization. I agree they’d probably be worse than davinci at explaining themselves.
Thanks for the thoughtful comments.
Out-of-context learning seems pretty sensitive to the task being measured, where some of the tasks see nice scaling behavior (hhh) while others do not (incorrect). This observation is based on Appendix A.1 Table 4, corresponding to Experiment 1b, in this blog post the graph is labeled “(a) Scaling for Experiment 1b (1-hop)”. Now, the fact that you get nice scaling lines when averaging across tasks is not super problematic or anything, but it is a little odd that there is so much variation between tasks, and I think it’s a point against any attempted nice, clean, explanations of the results.I agree it’s sensitive to the task measured. However, I think this is fairly typical of scaling results. E.g. for BIG-Bench, individual tasks don’t have smooth scaling curves (see the “emergence” results) but the curves look smooth when you average over many tasks. (Scaling curves for language modeling loss are implicitly averaging over a huge number of “tasks” because the pretraining set is so diverse).
It would ideal if we had hundreds of tasks (like BIG-Bench) rather than 7, but this is challenging given our setup and the capabilities of the GPT-3 model family. We did run a replication of our main experiment on a disjoint set of tasks (Fig 10b on page 26), which shows similar scaling results. This is some evidence that our our claims would generalize beyond the 7 tasks we chose.
There are two pieces of evidence against this. The influence function results, showing the Reversal Curse for models better than GPT-3, and our results in Experiment 2 for GPT3.5 and GPT-4.
If the training set includes texts of the form “A is B. A is also C”, then you have both orders present (A is B and B is A) and so the Reversal Curse is not applicable.
We trained ada, which is 350M parameters. We trained Llama-1 “aggressively” (e.g. for many epochs and with a hyperparameter sweep). It’s all in the paper.