Having worked at METR for some months last year, I just want to chime in to add that they have indeed seen the skulls. This post does a great service to the broader public by going into many important points at length. But these issues and others are also very much top of mind at METR, which is one of the reasons why they caveat results extensively in their publications.
If you haven’t been in touch or visited them already, I highly recommend it. They’re pretty awesome and love to discuss this sort of stuff!
I think this is true, but I also don’t think seeing the skulls implies actually dealing with them (and wish Scott’s post was crossposted here so I could argue with it). Like, a critique of AI evaluations that people could have been making for the last 5+ years (probably even 50) and which remains true today is “Evaluations do a poor job measuring progress toward AGI because they lack external validity. They test scenarios that are much narrower, well defined, more contrived, easier to evaluate, etc. compared to the skills that an AI would need to be able to robustly do in order for us to call it AGI.” I agree that METR is well aware of this critique, but the critique is still very much true of HCAST, RE-Bench, and SWAA. Folks at METR seem especially forward about discussing the limitations of their work in this regard, and yet the critique is still true. (I don’t think I’m disagreeing with you at all)
Having worked at METR for some months last year, I just want to chime in to add that they have indeed seen the skulls. This post does a great service to the broader public by going into many important points at length. But these issues and others are also very much top of mind at METR, which is one of the reasons why they caveat results extensively in their publications.
If you haven’t been in touch or visited them already, I highly recommend it. They’re pretty awesome and love to discuss this sort of stuff!
I think this is true, but I also don’t think seeing the skulls implies actually dealing with them (and wish Scott’s post was crossposted here so I could argue with it). Like, a critique of AI evaluations that people could have been making for the last 5+ years (probably even 50) and which remains true today is “Evaluations do a poor job measuring progress toward AGI because they lack external validity. They test scenarios that are much narrower, well defined, more contrived, easier to evaluate, etc. compared to the skills that an AI would need to be able to robustly do in order for us to call it AGI.” I agree that METR is well aware of this critique, but the critique is still very much true of HCAST, RE-Bench, and SWAA. Folks at METR seem especially forward about discussing the limitations of their work in this regard, and yet the critique is still true. (I don’t think I’m disagreeing with you at all)