RSS

AI Benchmarking

TagLast edit: 16 Jul 2023 14:12 UTC by rybolos

The real rea­son AI bench­marks haven’t re­flected eco­nomic impacts

Noosphere8915 Apr 2025 13:44 UTC
15 points
0 comments1 min readLW link
(epoch.ai)

In­tro­duc­ing BenchBench: An In­dus­try Stan­dard Bench­mark for AI Strength

Jozdien2 Apr 2025 2:11 UTC
49 points
0 comments2 min readLW link

Fron­tierMath Score of o3-mini Much Lower Than Claimed

YafahEdelman17 Mar 2025 22:41 UTC
61 points
7 comments1 min readLW link

Bro­ken Bench­mark: MMLU

awg29 Aug 2023 18:09 UTC
24 points
5 comments1 min readLW link
(www.youtube.com)

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

15 Oct 2024 18:25 UTC
30 points
0 comments18 min readLW link

“Su­per­hu­man” Isn’t Well Specified

JustisMills3 May 2025 23:42 UTC
34 points
9 comments3 min readLW link
(justismills.substack.com)

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

15 Jan 2024 21:21 UTC
33 points
0 comments1 min readLW link

LLM Psy­cho­met­rics: A Spec­u­la­tive Ap­proach to AI Safety

pskl29 Jan 2024 18:38 UTC
3 points
4 comments1 min readLW link
(pascal.cc)

Closed-ended ques­tions aren’t as hard as you think

electroswing19 Feb 2025 3:53 UTC
6 points
0 comments3 min readLW link

Auto-En­hance: Devel­op­ing a meta-bench­mark to mea­sure LLM agents’ abil­ity to im­prove other agents

22 Jul 2024 12:33 UTC
20 points
0 comments14 min readLW link

CRMArena-Pro: Holis­tic Assess­ment of LLM Agents Across Di­verse Busi­ness Sce­nar­ios and Interactions

Annapurna12 Jun 2025 19:53 UTC
8 points
0 comments1 min readLW link
(arxiv.org)

LLMs Suck at Deep Think­ing Part 3 - Try­ing to Prove It (fixed)

Taylor G. Lunt27 Sep 2025 14:54 UTC
17 points
6 comments15 min readLW link

Work­shop Re­port: Why cur­rent bench­marks ap­proaches are not suffi­cient for safety?

26 Nov 2024 17:20 UTC
3 points
1 comment3 min readLW link

[Question] An­thropic Is Go­ing All In On Abil­ity Without In­tel­li­gence?

Chapin Lenthall-Cleary7 Aug 2025 5:54 UTC
2 points
0 comments2 min readLW link

First Cer­tified Public Solve of Ob­server’s False Path In­sta­bil­ity — Level 4 (Ad­vanced Var­i­ant) — Walter Taran­telli — 2025-05-30 UTC

Walter Tarantelli31 May 2025 1:41 UTC
1 point
0 comments2 min readLW link

LLMs Still Suck at Log­i­cal Reasoning

anovikov18 Jul 2025 18:35 UTC
1 point
0 comments2 min readLW link

In-Con­text Schem­ing: A Run is Worth a Thou­sand Words

noise-field7 Mar 2025 2:47 UTC
10 points
0 comments1 min readLW link
(github.com)

AI Epistemic Gain

Generoso Immediato12 Aug 2025 14:03 UTC
0 points
0 comments10 min readLW link

Large Lan­guage Models Pass the Tur­ing Test

Matrice Jacobine2 Apr 2025 5:41 UTC
6 points
0 comments1 min readLW link
(arxiv.org)

A Guide For LLM-As­sisted Web Research

26 Jun 2025 18:39 UTC
46 points
3 comments7 min readLW link

Build­ing AI safety bench­mark en­vi­ron­ments on themes of uni­ver­sal hu­man values

Roland Pihlakas3 Jan 2025 4:24 UTC
18 points
3 comments8 min readLW link
(docs.google.com)

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_Dietz29 Jan 2025 21:01 UTC
9 points
5 comments4 min readLW link

Some les­sons from the OpenAI-Fron­tierMath debacle

7vik19 Jan 2025 21:09 UTC
71 points
9 comments4 min readLW link

Black-box in­ter­pretabil­ity method­ol­ogy blueprint: Prob­ing run­away op­ti­mi­sa­tion in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC
17 points
0 comments7 min readLW link

Edge Cases in AI Alignment

Florian_Dietz24 Mar 2025 9:27 UTC
19 points
3 comments4 min readLW link

De­tailed Ideal World Benchmark

Knight Lee30 Jan 2025 2:31 UTC
5 points
2 comments2 min readLW link

Smarter Models Lie Less

Expertium20 Jun 2025 13:31 UTC
6 points
0 comments2 min readLW link

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC
18 points
3 comments4 min readLW link
(medium.com)

Sys­tem­atic run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion for­mat (BioBlue)

16 Mar 2025 23:23 UTC
45 points
8 comments11 min readLW link
No comments.