Something I’m wrestling with on this project is the balance between testing the models’ ability to do science (which I want to do) and finding ways to make them better at doing science (which I basically don’t want to do and especially don’t want to publish). Doing a lot of iteration on improving scaffolding feels to me like it starts to tip over into the latter (whereas doing bog-standard few-shotting or fine-tuning doesn’t).
To be clear, I don’t have strong reason to expect that we’d find approaches that are significant boosts to what’s already out there. But it could happen, and I’m trying to be cautious about that, in the interest of not further accelerating capabilities improvements.
I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
Examples:
The exact scaffolding used by Sakana AI did not propel AGI capabilities as much compared to the common knowledge it created that LLMs can somewhat do end-to-end science.
No amount of scaffolding that the Arc AGI or Frontier Math team could build would have as much of an impact on AGI capabilities as the benchmarks themselves. These benchmark results basically validated that the direction OpenAI is taking is broadly correct, and I suspect many people who weren’t fully sold on test-time compute will now change strategies as a result of that.
Hard benchmarks of meaningful tasks serve as excellent metrics to measure progress, which is great for capabilities research. Of course, they are also very useful for making decisions that need to be informed by an accurate tracking or forecasting of capabilities.
Whether making hard meaningful benchmarks such as frontier math and arc agi and LLM science are net negative or positive is unclear to me (a load-bearing question is whether the big AGI labs have internal benchmarks as good as these already that they can use instead). I do think however that you’d have to be extraordinarily excellent at designing scaffolding (and finetuning and the like) and even then spend way too much effort at it to do significant harm from the scaffolding itself rather than the benchmark that the scaffolding was designed for.
I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
You may be right. That said, I’m pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can’t know what safety measures are needed or whether those measures are succeeding.
For what it’s worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I’ll support it.
Got it.
Something I’m wrestling with on this project is the balance between testing the models’ ability to do science (which I want to do) and finding ways to make them better at doing science (which I basically don’t want to do and especially don’t want to publish). Doing a lot of iteration on improving scaffolding feels to me like it starts to tip over into the latter (whereas doing bog-standard few-shotting or fine-tuning doesn’t).
To be clear, I don’t have strong reason to expect that we’d find approaches that are significant boosts to what’s already out there. But it could happen, and I’m trying to be cautious about that, in the interest of not further accelerating capabilities improvements.
I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
Examples:
The exact scaffolding used by Sakana AI did not propel AGI capabilities as much compared to the common knowledge it created that LLMs can somewhat do end-to-end science.
No amount of scaffolding that the Arc AGI or Frontier Math team could build would have as much of an impact on AGI capabilities as the benchmarks themselves. These benchmark results basically validated that the direction OpenAI is taking is broadly correct, and I suspect many people who weren’t fully sold on test-time compute will now change strategies as a result of that.
Hard benchmarks of meaningful tasks serve as excellent metrics to measure progress, which is great for capabilities research. Of course, they are also very useful for making decisions that need to be informed by an accurate tracking or forecasting of capabilities.
Whether making hard meaningful benchmarks such as frontier math and arc agi and LLM science are net negative or positive is unclear to me (a load-bearing question is whether the big AGI labs have internal benchmarks as good as these already that they can use instead). I do think however that you’d have to be extraordinarily excellent at designing scaffolding (and finetuning and the like) and even then spend way too much effort at it to do significant harm from the scaffolding itself rather than the benchmark that the scaffolding was designed for.
You may be right. That said, I’m pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can’t know what safety measures are needed or whether those measures are succeeding.
For what it’s worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I’ll support it.