benchmarks are how people optimize for capabilities. working on them has been a core method of increasing capabilities over the past 10+ years of machine learning, at minimum, since I started paying attention in 2015. if you want to increase endgame (asi) alignment outcomes, measuring capabilities is a distraction at best.
edit: whether this applies to bad behavior benchmarks is contested below. I happily restrict my claim to benchmarks that are measuring things that labs would have some reason to optimize for and still believe that covers this topic.
(I don’t think Dan Hendrycks deliberately released a benchmark with errors, but it would be hilarious, especially given his gripes about GPQA label noise.)
I also very much don’t think errors were on purpose. Abstaining from capability measurement doesn’t seem likely to do much else besides save ones’ own time to do something more useful; I just hope people who are excited to do alignment work don’t end up wasting their time on it. Errors seem likely to end up not particularly slowing things down—someone else is going to make a benchmark, but let a capability person do it so you can spend your time doing something more useful, like figuring out how to algorithmically generate arbitrary amounts of guaranteed-aligned training data from interacting with humans that has some sort of probabilistic guarantee of representing behavior that is in fact good rather than simply easy to generate. (Edit: which is probably done through a somewhat complex theoretical breakthrough about how to incentivize aligned-by-construction exploration or some such thing)
While I definitely agree there are negative externalities here, I also think there are extremely positive externalities from key decision makers being better informed, especially knowing how close we are to certain capabilities like AI enabled bioterrorism or cybercrime, or automated R&D/an intelligence explosion, etc. Information is great, and generally I think has a fairly positive effect even if the decision maker is not highly competent. Bioterrorism and cybercrime at least are not things I’m concerned about AGI researchers hill climbing on, automated R&D is much dicier
I’m surprised to hear that you aren’t concerned about negative benchmarks being hillclimb targets for anyone. This updates me somewhat, though the hypotheses I’m still worried about are ones where the dishonest labs, whichever those turn out to be, are the main source of optimizing-for-bad-behavior-benchmarks. I also expect that bio/chem tasks that aren’t malicious-use-specific, which is the topic at hand, will get optimized for by less-dishonest labs, in at least some cases.
Consider these possibilities for what benchmarks are doing here.
Training AI’s on quantum physics questions directly makes the AI smarter.
Training on quantum physics makes the AI memorize quantum trivia, but doesn’t make it smarter in some deep sense.
Like 2, except that the existence of the benchmark makes grad student descent more effective. People learn which AI algorithms work best via human trial and error.
Like 3, except that public benchmarks just get memorized. Making them useless for grad student descent.
benchmarks are how people optimize for capabilities. working on them has been a core method of increasing capabilities over the past 10+ years of machine learning, at minimum, since I started paying attention in 2015. if you want to increase endgame (asi) alignment outcomes, measuring capabilities is a distraction at best.
edit: whether this applies to bad behavior benchmarks is contested below. I happily restrict my claim to benchmarks that are measuring things that labs would have some reason to optimize for and still believe that covers this topic.
...🤔 …well played, Dan, well played.
(I don’t think Dan Hendrycks deliberately released a benchmark with errors, but it would be hilarious, especially given his gripes about GPQA label noise.)
I also very much don’t think errors were on purpose. Abstaining from capability measurement doesn’t seem likely to do much else besides save ones’ own time to do something more useful; I just hope people who are excited to do alignment work don’t end up wasting their time on it. Errors seem likely to end up not particularly slowing things down—someone else is going to make a benchmark, but let a capability person do it so you can spend your time doing something more useful, like figuring out how to algorithmically generate arbitrary amounts of guaranteed-aligned training data from interacting with humans that has some sort of probabilistic guarantee of representing behavior that is in fact good rather than simply easy to generate. (Edit: which is probably done through a somewhat complex theoretical breakthrough about how to incentivize aligned-by-construction exploration or some such thing)
While I definitely agree there are negative externalities here, I also think there are extremely positive externalities from key decision makers being better informed, especially knowing how close we are to certain capabilities like AI enabled bioterrorism or cybercrime, or automated R&D/an intelligence explosion, etc. Information is great, and generally I think has a fairly positive effect even if the decision maker is not highly competent. Bioterrorism and cybercrime at least are not things I’m concerned about AGI researchers hill climbing on, automated R&D is much dicier
Private benchmarks seem solid here too
I’m surprised to hear that you aren’t concerned about negative benchmarks being hillclimb targets for anyone. This updates me somewhat, though the hypotheses I’m still worried about are ones where the dishonest labs, whichever those turn out to be, are the main source of optimizing-for-bad-behavior-benchmarks. I also expect that bio/chem tasks that aren’t malicious-use-specific, which is the topic at hand, will get optimized for by less-dishonest labs, in at least some cases.
Yeah, I feel much better about malicious use specific ones. Agreed that HLE is more generic and this is much worse
Consider these possibilities for what benchmarks are doing here.
Training AI’s on quantum physics questions directly makes the AI smarter.
Training on quantum physics makes the AI memorize quantum trivia, but doesn’t make it smarter in some deep sense.
Like 2, except that the existence of the benchmark makes grad student descent more effective. People learn which AI algorithms work best via human trial and error.
Like 3, except that public benchmarks just get memorized. Making them useless for grad student descent.