Probably not. Is the thing something that actually has multiple things that could be causing it, or just one? An evaluation is fundamentally, a search process.
Searching for a more specific thing and trying to make your process not ping for anything other than one specific thing, makes it more useful, when searching in the v noisy things that are AI models.
Most evals dont even claim to measure one precise thing at all.
The ones that do—can you think of a way to split that up into more precise things? If so, there’s probably at least n number of more things to split it into, n negatively correlating with how good you are at this.
What data about the model is it using to do the search?
Order of things about the model that tell useful data, from least useful data to most useful:
Outputs
Logits
Activations
Weights
This is also the order for difficulty of measurements, but the data’s useful enough that any serious eval (there is one close to semi serious eval in the world atm, that I know of) engineers would try to solve this by trying harder.
Have they tried at all to red team it? Can I think of a better red teaming method in 15 seconds than what they said?
There’s one group of people other than myself and people I’ve advised that are trying red teaming evals atm. Trying to do some red teaming of an eval at all isnt terrible necessarily, but likely the red teaming will be terrible and they’ll be vastly overconfident in the newness and value of their work from that tiny little bit of semi rigorous looking motion
status—typed this on my phone just after waking up, saw someone asking for my opinion on another trash static dataset based eval and also my general method for evaluating evals. bunch of stuff here that’s unfinished.
# General how to Evaluate Evaluations
Is it measuring one very specific thing?
Probably not. Is the thing something that actually has multiple things that could be causing it, or just one? An evaluation is fundamentally, a search process. Searching for a more specific thing and trying to make your process not ping for anything other than one specific thing, makes it more useful, when searching in the v noisy things that are AI models. Most evals dont even claim to measure one precise thing at all. The ones that do—can you think of a way to split that up into more precise things? If so, there’s probably at least n number of more things to split it into, n negatively correlating with how good you are at this.
What data about the model is it using to do the search?
Order of things about the model that tell useful data, from least useful data to most useful: Outputs Logits Activations Weights This is also the order for difficulty of measurements, but the data’s useful enough that any serious eval (there is one close to semi serious eval in the world atm, that I know of) engineers would try to solve this by trying harder.
Have they tried at all to red team it? Can I think of a better red teaming method in 15 seconds than what they said?
There’s one group of people other than myself and people I’ve advised that are trying red teaming evals atm. Trying to do some red teaming of an eval at all isnt terrible necessarily, but likely the red teaming will be terrible and they’ll be vastly overconfident in the newness and value of their work from that tiny little bit of semi rigorous looking motion
status—typed this on my phone just after waking up, saw someone asking for my opinion on another trash static dataset based eval and also my general method for evaluating evals. bunch of stuff here that’s unfinished.
working on this course occasionally: https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.20uwc1photx3