Doing well on SimpleQA could also be evidence of benchmark contamination in the training data.
The “impossible scenario of actual time travel” [forward in time] is pretty funny. Thanks for replicating the key findings from the post.
Doing well on SimpleQA could also be evidence of benchmark contamination in the training data.
The “impossible scenario of actual time travel” [forward in time] is pretty funny. Thanks for replicating the key findings from the post.