AI Benchmarking

TagLast edit: 16 Jul 2023 14:12 UTC by rybolos

Broken Benchmark: MMLU

awg29 Aug 2023 18:09 UTC

23 points

5 comments1 min readLW link

(www.youtube.com)

Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols

Arjun Panickssery and agg

15 Jan 2024 21:21 UTC

33 points

0 comments1 min readLW link

LLM Psychometrics: A Speculative Approach to AI Safety

pskl29 Jan 2024 18:38 UTC

3 points

4 comments1 min readLW link

(pascal.cc)

MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC

14 points

2 comments4 min readLW link

(medium.com)

No comments.