I’ve been running a set of micro-evals for my own workflows, but I think the main obstacle is the fuzziness of real life tasks. Chess has some really nice failure signals, plus you get metrics like centipawn loss for free.
It takes significant mental effort to look at your own job/work and create micro-benchmarks that aren’t completely subjective. The trick that’s helped me the most is to steal a page from test-driven development (TDD):
- Write the oracle first, if I can’t eval as true/false the task is too mushy - shrink the scope until it breaks cleanly - iterate like unit tests; add new edge-cases whenever a model slips through or reward hacks
The payoff is being able to make clean dashboards that tell you “XYZ model passes 92% of this subcategory of test”.
I’ve been running a set of micro-evals for my own workflows, but I think the main obstacle is the fuzziness of real life tasks. Chess has some really nice failure signals, plus you get metrics like centipawn loss for free.
It takes significant mental effort to look at your own job/work and create micro-benchmarks that aren’t completely subjective. The trick that’s helped me the most is to steal a page from test-driven development (TDD):
- Write the oracle first, if I can’t eval as true/false the task is too mushy
- shrink the scope until it breaks cleanly
- iterate like unit tests; add new edge-cases whenever a model slips through or reward hacks
The payoff is being able to make clean dashboards that tell you “XYZ model passes 92% of this subcategory of test”.