It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.
It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.