corey morris comments on MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

corey morris 5 May 2025 4:35 UTC
1 point
0
Ran recently on a handful of Gemini models. Surprised to see that the sizeable gap between single scenario and dual scenario performance was still present for most models tested. 1.5-Flash, 2.0-Flash, and 2.0-Flash-Lite all still show a major gap between formats. Only the newest model, Gemini-2.5-Flash, has substantially closed this gap, especially when using its default reasoning setting. Even then, when reasoning is disabled, a moderate gap still exists.