This post summarizes the basic structure of high-stakes control evaluations (“win/continue/lose scenarios”) and the building blocks of blue team strategies (“E/R/A protocols”).
I found this post quite helpful for understanding control evals. I somewhat strongly recommend people new to control read this post before reading high-stakes control papers, e.g. the original control paper and Ctrl-Z.
Nit: It’s not totally clear what counts as a control “protocol”. According to the OG post on control, this is “a proposed plan for training, evaluation, and deployment (which we’ll refer to as a whole as a protocol)”—but ERA protocols exclude training (including online training in deployment), which we might want to do for the untrusted policy and/or monitor. It might be nice to reserve the word “protocol” for “everything we might want to do over the lifetime of a model possibly able to cause concentrated failures” and have a different word for “a set of things we typically do together in deployment”.
This post summarizes the basic structure of high-stakes control evaluations (“win/continue/lose scenarios”) and the building blocks of blue team strategies (“E/R/A protocols”).
I found this post quite helpful for understanding control evals. I somewhat strongly recommend people new to control read this post before reading high-stakes control papers, e.g. the original control paper and Ctrl-Z.
Nit: It’s not totally clear what counts as a control “protocol”. According to the OG post on control, this is “a proposed plan for training, evaluation, and deployment (which we’ll refer to as a whole as a protocol)”—but ERA protocols exclude training (including online training in deployment), which we might want to do for the untrusted policy and/or monitor. It might be nice to reserve the word “protocol” for “everything we might want to do over the lifetime of a model possibly able to cause concentrated failures” and have a different word for “a set of things we typically do together in deployment”.