Surprised no one posted about this from Anthropic, NYU and Uni of Sussex yet:
Instead of fine-tuning on human-preferences, they directly incorporate human feedback in the pre-training phase, conditioning the model on <good> or <bad> feedback tokens placed at the beginning of the training sequences.
They find this to be Pareto-optimal out of five considered pre-training objectives, greatly reducing the amount of undesired outputs while retaining standard LM pre-training downstream performance AND outperforming RLHF fine-tuning in terms of preference satisfaction.
This conditioning is very reminiscent of the decision transformer, where scalar reward tokens are prepended to the input. I believe CICERO also does something similar, conditioning on ELO scores during dialogue generation training.
From a discussion with James Chua on AISS’s slack, we noted similarities between this work and Charlie Steiner’s Take 13: RLHF bad, conditioning good. James is developing a library (“conditionme”) specifically for rating-conditioned language modelling and was looking for some feedback, which prompted the discussion. We figured potential future work here is extending the conditioning to scalar rewards (rather than the discrete <good> vs <bad>), which James pointed out requires some caution with the tokenizer, which he hopes to address in part with conditionme.