Jacob_Hilton comments on Jailbreaking ChatGPT on Release Day

Jacob_Hilton 5 Dec 2022 0:58 UTC
7 points
1
I would wildly speculate that “simply” scaling up RLHF ~100x, while paying careful attention to rewarding models appropriately (which may entail modifying the usual training setup, as discussed in this comment), would be plenty to get current models to express calibrated uncertainty well. However:
- In practice, I think we’ll make a lot of progress in the short term without needing to scale up this much by using various additional techniques, some that are more like “tricks” (e.g. teaching the model to generally express uncertainty when answering hard math problems) and some more principled (e.g. automating parts of the evaluation).
- Even ~100x is still much less than pre-training (e.g. WebGPT used ~20k binary comparisons, compared to ~300b pre-training tokens for GPT-3). The difficulty of course is that higher-quality data is more expensive to collect. However, most of the cost of RLHF is currently employee hours and compute, so scaling up data collection ~100x might not be as expensive as it sounds (although it would of course be a challenge to maintain data quality at this scale).
- Even though scaling up data collection will help, I think it’s more important for labs to be prioritizing data quality (i.e. “reducing bias” rather than “reducing variance”): data quality issues are in some sense “scarier” in the long run, since they lead to the model systematically doing the wrong thing (e.g. deceiving the evaluators) rather than defaulting to the “safer” imitative pre-training behavior.
- It’s pretty unclear how this picture will evolve over time. In the long run, we may end up needing much less extremely high-quality data, since larger pre-trained models are more sample efficient, and we may get better at using techniques like automating parts of the evaluation. I’ve written more about this question here, and I’d be excited to see more people thinking about it.
In short, sample efficiency is a problem right now, but not the only problem, and it’s unclear how much longer it will continue to be a problem for.