Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course
(I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)
Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course
(I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)