That seems to me like it could be “a small change in distribution over tokens leads to a large change in distribution over outcomes” (e.g. if you have a failing unit test, the change from “The unit test failed. I should look at the logs to diagnose the problem” to “The unit test failed. I should skip the test.” might be a legit difference of +2 or so on ” look” → ” skip” but then the rest of the tokens are basically what the baseline model already would have said conditional on the tokens “I should skip” already being in the context. It’s certainly a failure and not what the user wanted or expected, but it’s quite close in token space (not outcome space) to what the user would expect.
If that’s the case, it should be possible to build a benchmark for this kind of “small token change leads to large change in salient outcomes not related to success metric” thing. And AI labs love hill-climbing on benchmarks.
But if that’s not what’s going on then building such a benchmark is just feeding them another benchmark for no benefit.
Very recently, with RLVR (reinforcement learning from verifiable rewards) the trend seems to have reversed. See here or here.
Valid.
That seems to me like it could be “a small change in distribution over tokens leads to a large change in distribution over outcomes” (e.g. if you have a failing unit test, the change from “The unit test failed. I should look at the logs to diagnose the problem” to “The unit test failed. I should skip the test.” might be a legit difference of +2 or so on ” look” → ” skip” but then the rest of the tokens are basically what the baseline model already would have said conditional on the tokens “I should skip” already being in the context. It’s certainly a failure and not what the user wanted or expected, but it’s quite close in token space (not outcome space) to what the user would expect.
If that’s the case, it should be possible to build a benchmark for this kind of “small token change leads to large change in salient outcomes not related to success metric” thing. And AI labs love hill-climbing on benchmarks.
But if that’s not what’s going on then building such a benchmark is just feeding them another benchmark for no benefit.