Cleaner version: this is not a technical agenda. This is not something that would elicit interesting research questions from a technical alignment researcher. There are however interesting claims:
what a safe system ought to be like; it proposes three scales describing its reliability;
how far up the scales we should aim for at minimum;
how low on the scales currently large deployed models are.
While it positions a variety of technical agendas (mainly those of the co-authors) on the scales, the paper does not advocate for a particular approach, only the broad direction of “here are the properties we would like to have”. Uncharitably, it’s a reformulation of the problem.
The scales can be useful to compare the agenda that belong to the “let’s prove that the system adheres to this specification” family. It makes no claims over what the specification entails, nor failure modes of various (combinations of) levels.
I appreciate this paper as a gateway to the related agendas and relevant literature, but I’m not enthusiastic about it.
My raw and mostly confused/snarky comments as I was going through the paper can be found here (third section).
Cleaner version: this is not a technical agenda. This is not something that would elicit interesting research questions from a technical alignment researcher. There are however interesting claims:
what a safe system ought to be like; it proposes three scales describing its reliability;
how far up the scales we should aim for at minimum;
how low on the scales currently large deployed models are.
While it positions a variety of technical agendas (mainly those of the co-authors) on the scales, the paper does not advocate for a particular approach, only the broad direction of “here are the properties we would like to have”. Uncharitably, it’s a reformulation of the problem.
The scales can be useful to compare the agenda that belong to the “let’s prove that the system adheres to this specification” family. It makes no claims over what the specification entails, nor failure modes of various (combinations of) levels.
I appreciate this paper as a gateway to the related agendas and relevant literature, but I’m not enthusiastic about it.