Maybe I’m not looking in the right place, but the obvious question to this benchmark—how do humans fair in it? If humans score 0 too, then models scoring 0 is not a huge signal (even though authors claim that this bench is supposed to be closer to work of a real engineer).
Maybe I’m not looking in the right place, but the obvious question to this benchmark—how do humans fair in it? If humans score 0 too, then models scoring 0 is not a huge signal (even though authors claim that this bench is supposed to be closer to work of a real engineer).