Run evals on base models too!

(Creating more visibility for a comment thread with Rohin Shah.)

Currently, DeepMind’s capabilities evals are run on the post-RL*F (RLHF/​RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn’t a guarantee that it trains the model out of having the capabilities.

Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example:

  1. Running a simple eval on the post-RLHF model would reveal a much lower ELO than if you ran it on the base model, because it would generally find a way to lose. (In this example, you can imagine the red team qualitatively noticing the issue, but the example is an artificially simple one!)

  2. The post-RLHF model still has much of its chess knowledge latently available, in order to put up a good fight across the full range of human ability. Possibly it’s even superhuman at chess—I know I’d have to be better than you at chess in order to optimize well for an entertaining game for you. But that won’t show up in its ELO.

So it seems to me like running evals on the base model as well as the post-RL*F model is an extremely sensible precaution against (1), and I’d love to be reassured either that this is unnecessary for some really obvious and ironclad reason, or that someone is already working on this.

And I don’t have any good suggestion on (2), the idea that RL*F could reinforce a capability while also concealing it.