The aforementioned fact that privacy levels are upper bounds on how private things might be: afaik, METR has no airtight way to know what fully_private tasks were leaked or plagiarized or had extremely similar doppelganger tasks coincidentally created by unrelated third parties, or which public_problems had solutions posted somewhere they couldn’t see but which nevertheless made it into the training data, or whether some public_solutions had really good summaries written down somewhere online.
The way later models’ training sets can only ever have more benchmark tasks present in them than earlier ones. (I reiterate that at least some of these tasks were first created around the apparent gradient discontinuity starting with GPT-4o.)
The fact they’re using non-fully_private challenges at all, and therefore testing (in part) the ability of LLMs to solve problems present in (some of their) training datasets. (I get that this isn’t necessarily reassuring as some problems we’d find it scary for AI to solve might also have specific solutions (or hints, or analogies) in the training data.)
The fact they’re using preternaturally clean code-y tasks to measure competence. (I get that this isn’t necessarily reassuring as AI development is arguably a preternaturally clean code-y task.)
The way Baselined tasks tend to be easier and Baselining seems (definitely slightly) biased towards making them look easier still, while Estimated tasks tend to be harder and Estimation seems (potentially greatly) biased towards making them look harder still: the combined effect would be to make progress gradients look artificially steep in analyses where Baselined and Estimated tasks both matter. (To my surprise, Estimated tasks didn’t make much difference to (my reconstruction of) the analysis due (possibly/plausibly/partially) to current task horizons currently being under an hour; but if someone used HCAST to evaluate more capable future models without doing another round of Baselining . . .)
Possibly some other stuff I forgot, idk. (All I can tell you is I don’t remember thinking “this seems like it’s potentially biased against reaching scary conclusions” at any point when reading the papers.)
OH, I misunderstood you. “Biases point in the scarier direction” meant “METR is biased to make the results seem scarier than they are” whereas I misunderstood and thought it meant “We readers, looking at their results and attempting to factor in their biases, should update towards scariness when factoring in their biases, i.e. they are if anything biased towards less-scary presentation of their results.”
Potential biases:
The aforementioned fact that privacy levels are upper bounds on how private things might be: afaik, METR has no airtight way to know what fully_private tasks were leaked or plagiarized or had extremely similar doppelganger tasks coincidentally created by unrelated third parties, or which public_problems had solutions posted somewhere they couldn’t see but which nevertheless made it into the training data, or whether some public_solutions had really good summaries written down somewhere online.
The way later models’ training sets can only ever have more benchmark tasks present in them than earlier ones. (I reiterate that at least some of these tasks were first created around the apparent gradient discontinuity starting with GPT-4o.)
The fact they’re using non-fully_private challenges at all, and therefore testing (in part) the ability of LLMs to solve problems present in (some of their) training datasets. (I get that this isn’t necessarily reassuring as some problems we’d find it scary for AI to solve might also have specific solutions (or hints, or analogies) in the training data.)
The fact they’re using preternaturally clean code-y tasks to measure competence. (I get that this isn’t necessarily reassuring as AI development is arguably a preternaturally clean code-y task.)
The way Baselined tasks tend to be easier and Baselining seems (definitely slightly) biased towards making them look easier still, while Estimated tasks tend to be harder and Estimation seems (potentially greatly) biased towards making them look harder still: the combined effect would be to make progress gradients look artificially steep in analyses where Baselined and Estimated tasks both matter. (To my surprise, Estimated tasks didn’t make much difference to (my reconstruction of) the analysis due (possibly/plausibly/partially) to current task horizons currently being under an hour; but if someone used HCAST to evaluate more capable future models without doing another round of Baselining . . .)
Possibly some other stuff I forgot, idk. (All I can tell you is I don’t remember thinking “this seems like it’s potentially biased against reaching scary conclusions” at any point when reading the papers.)
OH, I misunderstood you. “Biases point in the scarier direction” meant “METR is biased to make the results seem scarier than they are” whereas I misunderstood and thought it meant “We readers, looking at their results and attempting to factor in their biases, should update towards scariness when factoring in their biases, i.e. they are if anything biased towards less-scary presentation of their results.”
Thanks!