<standup_comedian> What’s the deal with evals </standup_comedian>
epistemic status: tell me I’m wrong.
Funders seem particularly enchanted with evals, which seems to be defined as “benchmark but probably for scaffolded systems and scoring that is harder than scoring most of what we call benchmarks”.
I can conjure a theory of change. It’s like, 1. if measurement is bad then we’re working with vibes, so we’d like to make measurement good. 2. if measurement is good then we can demonstrate to audiences (especially policymakers) that warning shots are substantial signals and not base it on vibes. (question: what am I missing?)
This is an at least coherent reason why dangerous capability evals pay into governance strats in such a way that maybe philanthropic pressure is correct. It relies on cruxes that I don’t share, like that a principled science of measurement would outperform vibes in a meme war in the first place, but it at least has a crux that works as a fulcrum.
Everything worth doing is at least a little dual use, I’m not attacking anybody. But it’s a faustian game where, like benchmarks, evals pump up races cuz everyone loves it when number go up. The primal urge to see number go up infects every chart with an x and y axis, in other words, evals come with steep capabilities externalities because they spray the labs with more charts that number hasn’t gone up on yet, daring and challenging the lab to step up their game. So the theory of change in which, in spite of this dynamic, an eval is differentially defensive just has to meet a really high standard.
A further problem: the theory of change where we can have really high quality / inarguable signals as warning shots instead of vibes as warning shots doesn’t even apply to most of the evals I’m hearing about from the nonprofit and independent sector. I’m hearing about evals that make me go, “huh, I wonder what’s differentially defensive about that?” and I don’t get good answers. Moreover, an ancient wisdom says “never ask a philanthropist for something capitalism gives you for free”. The case for an individual eval’s unlikeliness to be created by default lab incentives needs to be especially strong, cuz when it isn’t strong one is literally doing the lab’s work for them.
If I’m allowed to psychoanalyze funders rather than discussing anything at the object level, I’d speculate that funders like evals because:
If you funded the creation of an eval, you can point to a concrete thing you did. Compare to funding theoretical technical research, which has a high chance of producing no tangible outputs; or funding policy work, which has a high chance of not resulting in any policy change. (Streetlight Effect.)
AI companies like evals, and funders seem to like doing things AI companies like, for various reasons including (a) the thing you funded will get used (by the AI companies) and (b) you get to stay friends with the AI companies.
This might blur the distinction between some evals. While it’s true that most evals are just about capabilities, some could be positive for improving LLM safety.
I’ve created 8 (soon to be 9) LLM evals (I’m not funded by anyone, it’s mostly out of my own curiosity, not for capability or safety or paper publishing reasons). Using them as examples, improving models to score well on some of them is likely detrimental to AI safety:
I think it’s possible to do better than these by intentionally designing evals aimed at creating defensive AIs. It might be better to keep them private and independent. Given the rapid growth of AI capabilities, the lack of apparent concern for an international treaty (as seen in the recent Paris AI summit), and the competitive race dynamics among companies and nations, specifically developing an AI to protect us from threats from other AIs or AIs + humans might be the best we can hope for.
My neighbor told me coyotes keep eating his outdoor cats so I asked how many cats he has and he said he just goes to the shelter and gets a new cat afterwards so I said it sounds like he’s just feeding shelter cats to coyotes and then his daughter started crying.
<standup_comedian>
What’s the deal with evals</standup_comedian>
epistemic status: tell me I’m wrong.
Funders seem particularly enchanted with evals, which seems to be defined as “benchmark but probably for scaffolded systems and scoring that is harder than scoring most of what we call benchmarks”.
I can conjure a theory of change. It’s like, 1. if measurement is bad then we’re working with vibes, so we’d like to make measurement good. 2. if measurement is good then we can demonstrate to audiences (especially policymakers) that warning shots are substantial signals and not base it on vibes. (question: what am I missing?)
This is an at least coherent reason why dangerous capability evals pay into governance strats in such a way that maybe philanthropic pressure is correct. It relies on cruxes that I don’t share, like that a principled science of measurement would outperform vibes in a meme war in the first place, but it at least has a crux that works as a fulcrum.
Everything worth doing is at least a little dual use, I’m not attacking anybody. But it’s a faustian game where, like benchmarks, evals pump up races cuz everyone loves it when number go up. The primal urge to see number go up infects every chart with an x and y axis, in other words, evals come with steep capabilities externalities because they spray the labs with more charts that number hasn’t gone up on yet, daring and challenging the lab to step up their game. So the theory of change in which, in spite of this dynamic, an eval is differentially defensive just has to meet a really high standard.
A further problem: the theory of change where we can have really high quality / inarguable signals as warning shots instead of vibes as warning shots doesn’t even apply to most of the evals I’m hearing about from the nonprofit and independent sector. I’m hearing about evals that make me go, “huh, I wonder what’s differentially defensive about that?” and I don’t get good answers. Moreover, an ancient wisdom says “never ask a philanthropist for something capitalism gives you for free”. The case for an individual eval’s unlikeliness to be created by default lab incentives needs to be especially strong, cuz when it isn’t strong one is literally doing the lab’s work for them.
If I’m allowed to psychoanalyze funders rather than discussing anything at the object level, I’d speculate that funders like evals because:
If you funded the creation of an eval, you can point to a concrete thing you did. Compare to funding theoretical technical research, which has a high chance of producing no tangible outputs; or funding policy work, which has a high chance of not resulting in any policy change. (Streetlight Effect.)
AI companies like evals, and funders seem to like doing things AI companies like, for various reasons including (a) the thing you funded will get used (by the AI companies) and (b) you get to stay friends with the AI companies.
This might blur the distinction between some evals. While it’s true that most evals are just about capabilities, some could be positive for improving LLM safety.
I’ve created 8 (soon to be 9) LLM evals (I’m not funded by anyone, it’s mostly out of my own curiosity, not for capability or safety or paper publishing reasons). Using them as examples, improving models to score well on some of them is likely detrimental to AI safety:
https://github.com/lechmazur/step_game—to score better, LLMs must learn to deceive others and hold hidden intentions
https://github.com/lechmazur/deception/ - the disinformation effectiveness part of the benchmark
Some are likely somewhat negative because scoring better would enhance capabilities:
https://github.com/lechmazur/nyt-connections/
https://github.com/lechmazur/generalization
Others focus on capabilities that are probably not dangerous:
https://github.com/lechmazur/writing—creative writing
https://github.com/lechmazur/divergent—divergent thinking in writing
However, improving LLMs to score high on certain evals could be beneficial:
https://github.com/lechmazur/goods—teaching LLMs not to overvalue selfishness
https://github.com/lechmazur/deception/?tab=readme-ov-file#-disinformation-resistance-leaderboard—the disinformation resistance part of the benchmark
https://github.com/lechmazur/confabulations/ - reducing the tendency of LLMs to fabricate information (hallucinate)
I think it’s possible to do better than these by intentionally designing evals aimed at creating defensive AIs. It might be better to keep them private and independent. Given the rapid growth of AI capabilities, the lack of apparent concern for an international treaty (as seen in the recent Paris AI summit), and the competitive race dynamics among companies and nations, specifically developing an AI to protect us from threats from other AIs or AIs + humans might be the best we can hope for.
(potential relevant meme)
That’s an edited version of this:
My neighbor told me coyotes keep eating his outdoor cats so I asked how many cats he has and he said he just goes to the shelter and gets a new cat afterwards so I said it sounds like he’s just feeding shelter cats to coyotes and then his daughter started crying.
Yep—I saw other meme-takes like this, assumed people might be familiar enough with it.
I was familiar enough to recognize that it was an edit of something I had seen before, but not familiar enough to remember what the original was