Defender7762

Karma: 0

Defender7762 17 Apr 2025 12:10 UTC
1 point
0
in reply to: Defender7762’s comment on: Defender7762′s Shortform
click the to expand all questions and answers for all models

Defender7762 17 Apr 2025 12:09 UTC
1 point
0
on: Defender7762′s Shortform
Anti-fitting generalized reasoning test for o3h/o4 mh https://llm-benchmark.github.io/ https://www.lesswrong.com/posts/CEHsJzBCmuhEDdNxg/debunk-the-myth-testing-the-generalized-reasoning-ability-of

Disappointing, I thought it would be much better than GROK, it seems that this version cannot be the one shown by ARC AGI in mid-December.

Defender7762 12 Apr 2025 10:43 UTC
1 point
0
in reply to: Robert Cousineau’s comment on: Debunk the myth -Testing the generalized reasoning ability of LLM
Thank you very much for your advice! You can click on the question and model name window to expand the answers of all models. Additionally, there is a commented-out ability calculator in the website’s source code. The ’50 times’ I mentioned refers to the probability derived from the normal distribution.

The ‘Time’ column represents the difficulty level of problems that the model can reliably solve, based on how long it would take a human to solve them. Longer times indicate more challenging problems. The standard deviation indicates the percentage of STEM individuals who can successfully solve the problem, following a normal distribution. A standard deviation of 0 implies that nearly 100% of the STEM population can solve such problems