Wow that is surprising! Even after considering the suite of caveats one applies to benchmarks as evidence, I am very surprised.
All things considered I think I still lean harder on self-reports from lab and non-lab technical staff regarding the elicitation delta, but I’m much less confident than before.
[I suspect we may have other less interesting disagreement about how economically useful current systems could be if more effort were put toward juicing them, but happy to talk about that some other time; just mentioning this for completeness or something.]
Wow that is surprising! Even after considering the suite of caveats one applies to benchmarks as evidence, I am very surprised.
All things considered I think I still lean harder on self-reports from lab and non-lab technical staff regarding the elicitation delta, but I’m much less confident than before.
[I suspect we may have other less interesting disagreement about how economically useful current systems could be if more effort were put toward juicing them, but happy to talk about that some other time; just mentioning this for completeness or something.]