How Do We Evaluate AI Evaluations?

In the last few years, I have had the chance to build a few benchmarks for LLMs. This process has led me to think deeply about how we, as a community, assess the quality of our own evaluation tools. There seems to be a lot of ambiguity in this area, with discussions often getting tangled in specifics without a clear, overarching framework. I thought I would write this short piece to provide some clarity of thought, a way to structure our thinking about how to evaluate the quality of our evals. I would love to hear what people have to say.

When we build an evaluation set, we are essentially creating a ruler to measure the progress of AI. But how do we know if our ruler is accurate? I’ve found it helpful to think about the quality of an evaluation set across two fundamental dimensions: Impact and Representation.

Let’s first consider Impact. This dimension asks a simple but critical question: if a model were to achieve a perfect score on this evaluation, how meaningful would that be? It forces us to look beyond the immediate challenge of the task and consider its real-world significance. For a long time, many of our benchmarks have focused on narrow, academic tasks. While these have been instrumental in driving progress, a high score on them does not always translate to a meaningful advance in a model’s utility. The true impact of a model’s capability is felt when it can perform tasks that are economically valuable, operate safely in high-stakes environments like medicine or finance, or navigate complex ethical terrains without causing harm. An evaluation with high impact is one that, if aced, would signify a genuine leap forward in one of these areas. It measures something that matters.

The second, and perhaps more technically nuanced, dimension is Representation. This gets to the heart of how well our evaluation set mirrors the true, underlying distribution of the capability we claim to be measuring. Think of the “test distribution” as the vast, almost infinite space of all possible questions, problems, and creative prompts related to a particular skill. Our evaluation set is, and can only ever be, a tiny sample drawn from this immense space. The critical question then becomes: how well does this small sample represent the whole?

This is where the pervasive problem of data contamination comes into play, and I believe that poor representation is its root cause. We often talk about contamination as a failure of process, where test data accidentally leaks into training data. But I think it’s more helpful to see it as a failure of representation.

Consider this thought experiment: what if we could create an evaluation set that was a perfect representation of the test distribution? What if it contained every possible question, every logical puzzle, every creative challenge? In such a scenario, training a model on this “evaluation set” would not be a bad thing at all. We would want our model to “contaminate” itself with this data, as doing so would mean it had learned the answers to all the questions in the world. It would have achieved a form of omniscience.

Of course, this is a theoretical impossibility. The true distribution of knowledge and reasoning is not only practically infinite but also constantly expanding. This is why the concept of an unseen test set is so sacred in machine learning. Because we can never create a perfectly representative set, we rely on a small, unseen sample to estimate how a model will perform on the rest of that vast, unseen distribution. When that sample is no longer unseen, it loses all its statistical power. Its ability to represent the broader landscape vanishes. It stops being a test of generalization and becomes a test of memorization. A high score on a contaminated set tells us nothing about how the model will handle the countless variations of problems it has never encountered before.

Thinking about evaluations through this lens of Impact and Representation can help simplify how we approach building better ones. It encourages us to ask more profound questions. Instead of just asking, “Is this benchmark hard?” we can ask, “Is this benchmark impactful?” Instead of just asking, “Is this data clean?” we can ask, “Is this data truly representative of the skill we want to measure?”

This framework is by no means complete, and it leads me to several open questions that I would love to discuss with the community:

Beyond qualitative assessment, how can we begin to formally quantify the “Impact” of an evaluation set?
As models are trained on ever-larger portions of the internet, what novel strategies can we employ to create evaluation data that is truly “unseen” and representative, especially for complex, multi-step reasoning tasks?
How should the relationship between Impact and Representation be balanced? Could a highly impactful but less representative evaluation still be valuable?
Are Impact and Representation the only two dimensions we should consider, or are there other fundamental aspects of evaluation quality that we are currently overlooking?

I look forward to hearing your thoughts.