AI Benchmarking

TagLast edit: 16 Jul 2023 14:12 UTC by rybolos

The real reason AI benchmarks haven’t reflected economic impacts

Noosphere8915 Apr 2025 13:44 UTC

15 points

0 comments1 min readLW link

(epoch.ai)

Introducing BenchBench: An Industry Standard Benchmark for AI Strength

Jozdien2 Apr 2025 2:11 UTC

49 points

0 comments2 min readLW link

FrontierMath Score of o3-mini Much Lower Than Claimed

YafahEdelman17 Mar 2025 22:41 UTC

61 points

7 comments1 min readLW link

Broken Benchmark: MMLU

awg29 Aug 2023 18:09 UTC

24 points

5 comments1 min readLW link

(www.youtube.com)

AI benchmarking has a Y-axis problem

Lizka6 Feb 2026 7:45 UTC

79 points

3 comments7 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

15 Oct 2024 18:25 UTC

30 points

0 comments18 min readLW link

I’m confused by the change in the METR trend

Expertium3 Mar 2026 11:30 UTC

46 points

17 comments2 min readLW link

“Superhuman” Isn’t Well Specified

JustisMills3 May 2025 23:42 UTC

34 points

9 comments3 min readLW link

(justismills.substack.com)

Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols

Arjun Panickssery and agg

15 Jan 2024 21:21 UTC

33 points

0 comments1 min readLW link

LLM Psychometrics: A Speculative Approach to AI Safety

pskl29 Jan 2024 18:38 UTC

3 points

4 comments1 min readLW link

(pascal.cc)

Closed-ended questions aren’t as hard as you think

electroswing19 Feb 2025 3:53 UTC

6 points

0 comments3 min readLW link

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Sam F. Brown, BasilLabib, Codruta (Coco) Lugoj and Sai Sasank Y

22 Jul 2024 12:33 UTC

20 points

0 comments14 min readLW link

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Annapurna12 Jun 2025 19:53 UTC

8 points

0 comments1 min readLW link

(arxiv.org)

LLMs Suck at Deep Thinking Part 3 - Trying to Prove It (fixed)

Taylor G. Lunt27 Sep 2025 14:54 UTC

17 points

7 comments15 min readLW link

Workshop Report: Why current benchmarks approaches are not sufficient for safety?

Tom DAVID and Pierre Peigné

26 Nov 2024 17:20 UTC

3 points

1 comment3 min readLW link

ARC-AGI-2 human baseline surpassed (updated)

Tim H12 Dec 2025 0:10 UTC

21 points

3 comments2 min readLW link

Reasons to care about Canary Strings

Alice Blair5 Dec 2025 21:41 UTC

27 points

3 comments2 min readLW link

Maybe benchmarks should be broken?

Jonathan Gabor17 Feb 2026 19:49 UTC

24 points

2 comments1 min readLW link

(jonathanpgabor.substack.com)

The Narrative Adherence Exam (NAE-15)

Max Brown2 Feb 2026 23:38 UTC

1 point

0 comments4 min readLW link

Every Benchmark is Broken

Jonathan Gabor24 Jan 2026 2:42 UTC

95 points

0 comments4 min readLW link

(jonathanpgabor.substack.com)

[Question] Anthropic Is Going All In On Ability Without Intelligence?

Chapin Lenthall-Cleary7 Aug 2025 5:54 UTC

2 points

0 comments2 min readLW link

Unpacking Multimodal Data Leakage, Broken Benchmarks, and the Hessian Fallacy

Xenomirant18 Apr 2026 1:16 UTC

8 points

0 comments8 min readLW link

First Certified Public Solve of Observer’s False Path Instability — Level 4 (Advanced Variant) — Walter Tarantelli — 2025-05-30 UTC

Walter Tarantelli31 May 2025 1:41 UTC

1 point

0 comments2 min readLW link

An Empirical Review of the Animal Harm Benchmark

lukasgebhard1 Mar 2026 18:20 UTC

16 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

LLMs Still Suck at Logical Reasoning

anovikov18 Jul 2025 18:35 UTC

1 point

0 comments2 min readLW link

AGI’s Last Bottlenecks

adamk23 Oct 2025 3:28 UTC

17 points

2 comments9 min readLW link

In-Context Scheming: A Run is Worth a Thousand Words

noise-field7 Mar 2025 2:47 UTC

10 points

0 comments1 min readLW link

(github.com)

AI Epistemic Gain

Generoso Immediato12 Aug 2025 14:03 UTC

0 points

0 comments10 min readLW link

Large Language Models Pass the Turing Test

Matrice Jacobine2 Apr 2025 5:41 UTC

6 points

0 comments1 min readLW link

(arxiv.org)

A Guide For LLM-Assisted Web Research

nikos, dschwarz, Lawrence Phillips and FutureSearch

26 Jun 2025 18:39 UTC

46 points

3 comments7 min readLW link

Building AI safety benchmark environments on themes of universal human values

Roland Pihlakas3 Jan 2025 4:24 UTC

18 points

3 comments12 min readLW link

(docs.google.com)

Revealing alignment faking with a single prompt

Florian_Dietz29 Jan 2025 21:01 UTC

9 points

5 comments4 min readLW link

Some lessons from the OpenAI-FrontierMath debacle

7vik19 Jan 2025 21:09 UTC

71 points

9 comments4 min readLW link

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC

17 points

0 comments7 min readLW link

Edge Cases in AI Alignment

Florian_Dietz24 Mar 2025 9:27 UTC

19 points

3 comments4 min readLW link

Detailed Ideal World Benchmark

Knight Lee30 Jan 2025 2:31 UTC

5 points

2 comments2 min readLW link

Smarter Models Lie Less

Expertium20 Jun 2025 13:31 UTC

6 points

0 comments2 min readLW link

MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC

18 points

3 comments4 min readLW link

(medium.com)

Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns

Roland Pihlakas28 Dec 2025 21:53 UTC

14 points

0 comments8 min readLW link

Mapping AI Capabilities to Human Expertise on the Rosetta Stone (Epoch Capabilities Index)

Laura Domenech and Jérémy Andréoletti

9 Mar 2026 17:09 UTC

18 points

2 comments6 min readLW link

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

16 Mar 2025 23:23 UTC

45 points

8 comments16 min readLW link

SWE-Bench Pro is even worse

Jonathan Gabor24 Feb 2026 22:51 UTC

24 points

0 comments1 min readLW link

(jonathanpgabor.substack.com)

No comments.