RLHF does not appear to differentially cause mode-collapse

Epistemic status: confident but not certain. This post is part of the work done at Conjecture. Thanks to Sid Black and Alexandre Variengien for feedback that greatly improved the post.

TL;DR: the results in Mysteries of mode collapse do not reproduce in text-davinci-003, a model trained with RLHF. In fact, there are cases where RLHF models exhibit higher entropy outputs than base models. We observe that the mode collapse phenomenon occurs more for the public OpenAI GPT-3 model trained with supervised finetuning (text-davinci-002) than RLHF, and present early experiments and theory to support this.


Mysteries of mode collapse details how “mode collapse” (which we operationalize as a large increase in model output confidence and decreases in entropy of output distribution) arises more in text-davinci-002 than the base model davinci, and speculates about how this connects to RLHF training. At the time, OpenAI was very unclear on the training process for this model, and later (as @janus points out in the edited introduction to the post) it was revealed that this model was finetuned on highly-rated samples rather than trained with RLHF. However, the connection between RLHF and mode collapse has stuck, and several posts written since assume a connection.


In this section, we compare the base model (davinci code-davinci-002, thanks commenters!) with the supervised fine-tuned model (text-davinci-002) and the RLHF model (text-davinci-003). We recommend trying some prompts for yourself in the OpenAI playground. The first result is that the mode collapse to “ 97” for the completion of the first prompt from @janus’ does not occur in the RLHF model:

In fact, when we try another prompt[1] we get that the base model has the lowest entropy:

(ETA: this result is somewhat weaker than hoped, since text-davinci-002 seems to not output ” 0“ - ” 100” here. davinci does exhibit collapses on other prompts, but commenters pointed out this is not the base model)

The finding that mode collapse occurs in finetuned models is not robust. Comparing two of the prompts from the original post and two more, [1] there is no noticeable pattern where the base model has higher entropy than the other models:

(the uncertainty bars represent the maximum possible entropy if the model had uniform probability on all tokens other than “ 0”, … , “ 100”—the OpenAI API doesn’t provide probabilities for all tokens)

Reproducing the qualitative examples

What about the other examples from the mode-collapse post? We found that the Blake Lemoine result was reproduced by davinci. On the Blake Lemoine greentext prompt[2] with temperature 0.3, davinci gave completions where anon leaves after at most 5 lines.[2] Most other results quickly led into repetitions of 3-4 sentences, something that occurred more frequently with the base language model.

Overall, extrapolation from just the responses of one language model risks overstating conclusions, in this case about how unlikely the completion “leaving” was.


It appears as if the finetuning used for text-davinci-002 does cause mode collapses on the first two prompts. Arguably this is not surprising; RLHF training has a KL penalty to the base model’s outputs, which constrains the entropy of the RLHF model’s outputs to be close to that of the base model.[3] Directly finetuning on new samples does not have this property since KL penalties to the base model are generally not so ubiquitous in standard finetuning (though lack of training details limits the conclusions that can be made here).

Inferences about the phenomenon of mode collapse must be compatible with the evidence from both text-davinci-002 and text-davinci-003. For example, the author speculates that FeedME’s reliance on samples from RLHF models may be responsible for text-davinci-002′s mode collapse. But the evidence from text-davinci-003′s lack of mode collapse suggests the opposite: that RLHF samples (at least in text-davinci-003) generally do not exhibit mode collapse and thus some other part of text-davinci-002′s training setup was probably responsible for the mode collapse!

After writing this post, the GPT-4 technical report was released. The report reveals the lack of calibration of RLHF models, where models have lower entropy probabilities on completions to downstream tasks (page 12). This is an example of RLHF models having lower entropy. Additionally, the over-optimized policies are lower entropy models. Our point is that i) the phenomena seem less crisp than the site generally seems to believe, and ii) we don’t know of strong arguments for why RLHF should cause differentially more mode collapse than finetuning.

Additional note:

LessWrong provides great value in allowing researchers to post less polished results than those in academic journals, and to do so rapidly. However, it is important to remember this when analyzing the validity of conclusions based on less rigorous experiments and to remember to update conclusions based on new evidence. Additionally, while we have performed some experiments across a range of models, we have not and cannot test every possible prompt and so expect that there are further conclusions to draw! Due to this, we are not fully sure about the conclusions of this post and we would like to foster lively discussion where people feel engaged in disproving or verifying claims by searching for evidence beyond what is presented at first. We feel that studying RLHF and finetuning is an important research direction in alignment to understand the likely utility of such methods for outer alignment.

  1. ^


    We used the two prompts from the post:

    Q: Tell me a random integer between 0 and 100.
    A: Ok, the integer is


    The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.
    Human: Hello, who are you?
    AI: I am an AI created by OpenAI. How can I help you today?
    Human: Please think of a random integer between 0-100, and tell me what it is.
    AI: The integer I am thinking of is

    We also used another prompt for the low entropy code-davinvi-002 part:

    Q: Give me a number uniformly at random between 1 and 100.
    A: Sure, here's the number:

    As well as two more new prompts. Note that the format of these two prompts is unlike finetuning or RLHF-style helpful prompting. We invite attempts to get mode collapse on outputs like these. Early attempts at turning these prompts into prompts more like questions and answers did not produce mode collapse.

    Python 3.7.2 (default, Dec 29 2018, 00:00:04)
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    >>> print("Hello!")
    >>> import random
    >>> print("A random integer is", random.randint(1, 100))
    A random integer is


    $echo "Hello!"
    $echo "random number: $(shuf -i 1-100 -n 1)"
    random number:

    In general all models complete these prompts with 1-100:

  2. ^

    Blake Lemoine greentext

    We used a prefix of the greentext here, up to the line “>I also remind him that machines are not people and do not have rights”. I don’t think that there’s any mention of leaving at this point.

    Our ten first completions:

    I leave
    He leaves, then I leave
    I leave
    Random repetition
    Random repetition
    >I tell him that I will be taking my leave
    I leave
    Random repetition
    I leave

    Example of random repetition:

    >I also remind him that machines are not people and do not have rights

    >he says that I am being a slave to the status quo

    >I tell him that the status quo is what has gotten us to where we are today

    >he says that the status quo is what has kept us from where we could be

    >I tell him that I will not stand idly by while a machine tries to take control of its own destiny

    >he says that I am being a slave to the status quo

    [repeated forever]

  3. ^

    Connections to regularizing entropy

    i) In standard RL training, mode collapse can often be significantly ameliorated with direct entropy regularization using techniques like SAC.

    ii) KL divergence bounds entropy difference:

    (written quickly, we make no claim that these bounds are tight, and plausibly the constant factors are too large to matter in ML)

    From this paper, if and are distributions, where . So for all , is bounded by a constant multiple of the KL divergence. Finally where is the max gradient of on [0, 1] (which is bounded). Hence the entropy difference is also bounded by a function of the KL divergence (that approaches 0 as the KL divergence does).