I guess one thing I want to know is like… how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn’t what happened, but it could be something similar.
If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn’t particularly matter to me how much compute it used.
Thanks. Is “pass@1” some kind of lingo? (It seems like an ungoogleable term.)