Update to Mysteries of mode collapse: text-davinci-002 not RLHF

I (and many others) did not realize this before, but: text-davinci-002 and text-davinci-001, the InstructGPT models on the OpenAI API, were not trained with RLHF (reinforcement learning from human feedback) as described in the InstructGPT paper, but a “similar but slightly different”[1] method that uses the same human feedback data. Apparently, this other method is not technically RLHF.

Since this update has potentially nontrivial implications for interpreting the phenomena exhibited by text-davinci-002 described in Mysteries of mode collapse (formerly titled “Mysteries of mode collapse due to RLHF”), I’m making this separate post for a signal boost.

I have not corrected the original text of “Mysteries of mode collapse due to RLHF”, but I’ve added a section at the beginning with further details on this update, copied here:

I have received evidence from multiple credible sources that text-davinci-002 was not trained with RLHF.

The rest of this post has not been corrected to reflect this update. Not much besides the title (formerly “Mysteries of mode collapse due to RLHF”) is affected: just mentally substitute “mystery method” every time “RLHF” is invoked as the training method of text-davinci-002. The observations of its behavior otherwise stand alone.

This is kind of fascinating from an epistemological standpoint. I was quite surprised to learn that text-davinci-002 was probably not trained with RLHF. I don’t remember exactly how ”text-davinci-002 is RLHF” got elevated to an unquestioned assumption in my mind. I might have mistook not being contradicted by people who I assumed were in the know as confirmation. I certainly did not expect to talk for months to dozens of people about odd behaviors I’ve observed in a well-known model “due to RLHF” without being contradicted in a world where the model in question wasn’t trained with RLHF, but that’s what happened.[2] It wasn’t just me either: the assumption that text-davinci-002(/​text-davinci-001) is InstructGPT is RLHF seems ambient (e.g. search “text-davinci-002 rlhf” on Twitter, this LW post, this article, and many others). I contributed to perpetuating this misinformation cascade, and for that I apologize.

text-davinci-002‘s behaviors described in this post also contributed to my confidence because RLHF seemed to be a likely and potentially satisfying explanation. Its apparently unsubstantiated confidence in very specific outcomes seems antithetical to the outer objective of self-supervised learning, which is optimized by epistemic calibration, meaning the model’s entropy should be as high as possible while fitting the data. In contrast, as several comments have pointed out, it makes sense that RL kills entropy. The presence of “attractors” made me additionally suspect that optimization from non-myopic outcome-supervision was formative to text-davinci-002’s psyche.

Mode collapse and attractors do seem to also be caused by RLHF (see Dumbass policy pls halp and Inescapable wedding parties). So the update is that some other training method also gives rise to these phenomena, as they are manifested by text-davinci-002.

Whether and how speculations concerning the causes of mode collapse/​attractors should be affected depends on how text-davinci-002’s training method differs from RLHF.

What is known about text-davinci-002’s training method

Publicly available information suggests that the mystery method may not be so different from RLHF. Just today I discovered this sidenote in OpenAI’s blog post Aligning Language Models to Follow Instructions:

The InstructGPT models deployed in the API are updated versions trained using the same human feedback data. They use a similar but slightly different training method that we will describe in a forthcoming publication.

AFAIK, this is all that OpenAI has published about the RLHF/​mystery method diff. It says that the InstructGPT models (text-davinci-001 and text-davinci-002) were trained using the same human feedback data as the method described in OpenAI’s RLHF paper.[3] But this “similar but slightly different” method is apparently sufficiently different to not qualify as RLHF!

Pending further revelations, I suppose the lesson here was that I should have sustained more entropy in my belief state given the partial information I had. But what a demanding thing to ask! So much easier to promote an attractive hypothesis to the status of decisive fact and collapse the remainder than to hold a superposition in the mind.

  1. ^

    Sidenote on OpenAI’s blog post, Aligning Language Models to Follow Instructions

  2. ^

    the lack of epistemic vigilantes attacking an unsubstantiated assumption in the very title of this post on LessWrong is truly unbelievable!

  3. ^

    which seems to confirm my suspicion about outcome-supervision