RSS

AI Benchmarking

TagLast edit: 16 Jul 2023 14:12 UTC by rybolos

Bro­ken Bench­mark: MMLU

awg29 Aug 2023 18:09 UTC
23 points
5 comments1 min readLW link
(www.youtube.com)

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

15 Jan 2024 21:21 UTC
33 points
0 comments1 min readLW link

LLM Psy­cho­met­rics: A Spec­u­la­tive Ap­proach to AI Safety

pskl29 Jan 2024 18:38 UTC
3 points
4 comments1 min readLW link
(pascal.cc)

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC
14 points
2 comments4 min readLW link
(medium.com)
No comments.