[Paper] Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods

“If you cannot measure it, you cannot improve it.”—Lord Kelvin

The science of AI safety evaluations is still nascent, but it is making progress! We know much more today than we did two years ago.

We tried to make this knowledge accessible by writing a literature review and systematizing all the knowledge.

We wanted to go beyond a central source that just summarized collected knowledge, so we put a lot of effort into distillation, and disentangling concepts that are often presented in a messy way. We created original visualizations and gathered others from many different sources to accompany these explanations.

E.g. Disentangling truth, honesty, hallucination, deception and scheming.

The review provides a taxonomy of AI safety evaluations along three dimensions:

What properties do we measure? (dangerous capabilities, propensities, and control)
How do we measure them? (Behavioral and internal techniques)
How do we integrate evaluations into broader frameworks? (Model Organisms , Responsible Scaling Policies, …)

We also bring up some limitations that safety evaluations might face including things like: “sandbagging” (strategic underperformance on tests), organizational “safetywashing” (misrepresenting capability improvements as safety advancements), or other more fundamental inherent challenges like proving absence rather than presence of capabilities.

The text is available in many different places:

This paper is intended to be part of a larger body of work called the AI Safety Atlas. We think of it as chapter 5 in a comprehensive collection of literature reviews collectively forming a textbook for safety.