I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
You may be right. That said, I’m pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can’t know what safety measures are needed or whether those measures are succeeding.
For what it’s worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I’ll support it.
You may be right. That said, I’m pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can’t know what safety measures are needed or whether those measures are succeeding.
For what it’s worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I’ll support it.