Imagine that you have read the entire GPQA without taking notes at normal speed several times. Then, after a week, you answer all GPQA questions with 100% accuracy. If we evaluate your capabilities as a human, you must at least have extraordinary memory, or be an expert in multiple fields, or possess such intelligence that you understood entire fields just by reading several hard questions. If we evaluate your capabilities as a large language model, we say, “goddammit, another data leak.”
Why? Because humans are bad at memorizing, so even having just good memory places you in high quantiles of intellectual abilities. But computers are very good at memorization, so achieving 100% accuracy on GPQA doesn’t tell us anything useful about the intelligence of a particular computer.
We already use “double standards” for computers in capability evaluations, because computers are genuinely different, and that’s why we use “double standards” for computers in safety evaluations.
Quick comment on “Double Standards and AI Pessimism”:
Imagine that you have read the entire GPQA without taking notes at normal speed several times. Then, after a week, you answer all GPQA questions with 100% accuracy. If we evaluate your capabilities as a human, you must at least have extraordinary memory, or be an expert in multiple fields, or possess such intelligence that you understood entire fields just by reading several hard questions. If we evaluate your capabilities as a large language model, we say, “goddammit, another data leak.”
Why? Because humans are bad at memorizing, so even having just good memory places you in high quantiles of intellectual abilities. But computers are very good at memorization, so achieving 100% accuracy on GPQA doesn’t tell us anything useful about the intelligence of a particular computer.
We already use “double standards” for computers in capability evaluations, because computers are genuinely different, and that’s why we use “double standards” for computers in safety evaluations.