TLDR my reaction is I don’t really know how good these models are right now.
I felt exactly the same after the Claude 3.7 post.But actually.. hasn’t LiveBench solved the evals crisis?
It is specifically targeted a “subjective” and “cheating/hacking” problems. It also cover a pretty broad set of capabilities.
I felt exactly the same after the Claude 3.7 post.
But actually.. hasn’t LiveBench solved the evals crisis?
It is specifically targeted a “subjective” and “cheating/hacking” problems.
It also cover a pretty broad set of capabilities.