A library for safety research in conditioning on RLHF tasks

There is a couple of discussions regarding conditioning / decision transformer training. “Conditioning” as in, placing your reward model’s reward in the prompt. We then train our language model to create a completion that follows the specified reward.

See Safety considerations for online generative modeling, Soft optimization makes the value target bigger, RLHF bad, conditioning good.

The tldr is that training models this way could have safety benefits.

I’ve created a library so that we can extend a pre-trained LLM (gpt2, gpt-j) to work with conditioning by scalar rewards. This allows researchers to save time by avoiding the need to modify attention masks, positions, and labels themselves. For example, researchers can retrain GPT-2 to replicate OpenAI’s summarization RLHF paper, but by relying purely on conditioning. I created it because I couldn’t find an existing library that did so.

Note: an easier way of conditioning would be to use discrete tokens. Pretraining models With Human Preferences implements conditioning via discrete <|good|> and <|bad|> tokens.

However using scalar rather than discrete rewards could have the following benefits

The ability to specify a higher reward than the maximum reward available during training. For example, during training the maximum reward obtained was 1. However during inference you can set it to 2. Would we get a better result, as demonstrated in the decision transformer paper? (Thanks @Tomek Korbak for pointing this out)
More efficient training since we don’t lose information during discretization (yet to be empirically shown?)

While the library is still in early development, it can already be used for offline training of GPT-2.

I’m writing this post earlier rather than later to:

Get feedback on whether this is useful. Or that it isn’t useful and I should spent time on something else :)
Get feedback on how to improve the library if the above statement is true.
Save someone hours trying to do the same thing (just in case that happens)