The actual details of it contain some wise non-obvious aspects, along with elegant concepts that are generalizations of things that the safety community has been touching at. For instance the safety community has been conflating in “risk thresholds” two cleanly distinct notions in risk management of Key Risk Indicators (actual measurements of risk) and risk tolerance (your quantified preference for risk, independent from any test), which has caused a lot of confusion and hidden unreasonable choices for quite a bit.
People have also been conflating risk modeling and evals for quite a long time, because the AI field was built around evals. Once you have the clear view that evals are just an operationalization of risk models, it becomes more clear that you can actually do most of your risk modeling earlier in the lifecycle (i.e. before even touching a neural net), before having built a single eval & that evals are downstream of this.
You can see more of this genre of concepts applied to frontier AI here: https://arxiv.org/pdf/2502.06656 Here’s a graph with a few of the concepts in there
to make sure I understand correctly, are you saying that a lot of the value of having this kind of formalized structure is to make it harder for people to make intuitive but flawed arguments by equivocating terms?
are there good examples of such frameworks preventing equivocation in other industries?
Yes, that’s one value. RSPs & many policy debates around it would have been less messed up if there had been clarity (i.e. they turned a confused notion into the standard, which was then impossible to fix in policy discussions, making the Code of Practice flawed). I don’t know of a specific example of preventing equivocation in other industries (it seems hard to know of such examples?) but the fact that basically all industries use a set of the same concepts is evidence that they’re pretty general-purpose and repurposable.
Another is just that it helps thinking in a generalized ways about the issues. For instance, once you see evaluations as a Key Risk Indicator (i.e. a proxy measure of risk), you can notice that we could also use other Key Risk Indicators to trigger mitigations, such as actual monitoring metrics. This could enable to build conditions/thresholds in RSPs that are based on monitoring metrics (e.g. “we find less than 5 bioterrorists successfully jailbreaking our model per year on our API”). The more generalized concepts enables more compositionality of ideas in a way that skips you a bunch of the trial and error process.
The actual details of it contain some wise non-obvious aspects, along with elegant concepts that are generalizations of things that the safety community has been touching at. For instance the safety community has been conflating in “risk thresholds” two cleanly distinct notions in risk management of Key Risk Indicators (actual measurements of risk) and risk tolerance (your quantified preference for risk, independent from any test), which has caused a lot of confusion and hidden unreasonable choices for quite a bit.
People have also been conflating risk modeling and evals for quite a long time, because the AI field was built around evals. Once you have the clear view that evals are just an operationalization of risk models, it becomes more clear that you can actually do most of your risk modeling earlier in the lifecycle (i.e. before even touching a neural net), before having built a single eval & that evals are downstream of this.
You can see more of this genre of concepts applied to frontier AI here: https://arxiv.org/pdf/2502.06656
Here’s a graph with a few of the concepts in there
to make sure I understand correctly, are you saying that a lot of the value of having this kind of formalized structure is to make it harder for people to make intuitive but flawed arguments by equivocating terms?
are there good examples of such frameworks preventing equivocation in other industries?
Yes, that’s one value. RSPs & many policy debates around it would have been less messed up if there had been clarity (i.e. they turned a confused notion into the standard, which was then impossible to fix in policy discussions, making the Code of Practice flawed). I don’t know of a specific example of preventing equivocation in other industries (it seems hard to know of such examples?) but the fact that basically all industries use a set of the same concepts is evidence that they’re pretty general-purpose and repurposable.
Another is just that it helps thinking in a generalized ways about the issues.
For instance, once you see evaluations as a Key Risk Indicator (i.e. a proxy measure of risk), you can notice that we could also use other Key Risk Indicators to trigger mitigations, such as actual monitoring metrics. This could enable to build conditions/thresholds in RSPs that are based on monitoring metrics (e.g. “we find less than 5 bioterrorists successfully jailbreaking our model per year on our API”). The more generalized concepts enables more compositionality of ideas in a way that skips you a bunch of the trial and error process.