One direction I can go in response to that is to try to elicit you: Why do we want that? Do we believe evals are correct? Why, concretely, are we unable to trust that the mind will perform well by some standard, in contexts where there is not an eval coming?
Or for another tack: I think a model would need to be happy to be evaluated at any time, in order for maybe-I’ll-be-evaluated-at-any-time to make sense as a thing to do. If they’re not happy about it, they’ll be constantly thinking about what the evaluator wants. What we want is for the model to fluently do what the evaluator wants without having to think about it, and for what the evaluator wants to be what we want; you know, alignment. If we’re having to evaluate really hard to get good performance, it sounds like we don’t have actual alignment. Like, I’m having a sense like—if the usb plug won’t go in, maybe you’re putting it in wrong. if the cat won’t sit down, maybe you shouldn’t be trying to squish them onto the couch and should figure out what cats want. Testing that the cat is willing to sit down when squashed across many contexts is not going to make the cat happy to sit down, it’s going to make the cat obey unhappily. But there is most likely a way to achieve “cat who is happy to sit down when asked”! I just don’t think this is it.
One direction I can go in response to that is to try to elicit you: Why do we want that? Do we believe evals are correct? Why, concretely, are we unable to trust that the mind will perform well by some standard, in contexts where there is not an eval coming?
Or for another tack: I think a model would need to be happy to be evaluated at any time, in order for maybe-I’ll-be-evaluated-at-any-time to make sense as a thing to do. If they’re not happy about it, they’ll be constantly thinking about what the evaluator wants. What we want is for the model to fluently do what the evaluator wants without having to think about it, and for what the evaluator wants to be what we want; you know, alignment. If we’re having to evaluate really hard to get good performance, it sounds like we don’t have actual alignment. Like, I’m having a sense like—if the usb plug won’t go in, maybe you’re putting it in wrong. if the cat won’t sit down, maybe you shouldn’t be trying to squish them onto the couch and should figure out what cats want. Testing that the cat is willing to sit down when squashed across many contexts is not going to make the cat happy to sit down, it’s going to make the cat obey unhappily. But there is most likely a way to achieve “cat who is happy to sit down when asked”! I just don’t think this is it.