As far as I understand it, forcing the model to output ONLY the number is similar to asking a human to guess really quickly. I expect most humans’ actual intuitions to be more like “a thousand is a big number and dividing it by 57 yields something[1] a bit less than 20, but it doesn’t help me to estimate the remainder”. The model’s unknown algorithm, however, produces answers which are surprisingly close to the ground truth, differing by 0-3.
Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would’ve noticed that 59 is slop and correct it almost instantly.
P.S. In order to check my idea, I tried prompting Claude Sonnet 4.5 with variants of the same question: here, here, here, here, here. One of the results stood out in particular. When I prompted Claude and told it that I tested its ability to answer instantly, performance dropped to something closer to the lines of “1025 - a thousand”.
Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would’ve noticed that 59 is slop and correct it almost instantly.
OK, maybe my statement is too strong. Roughly, how I feel about it:
If you assume there are no cases where the model makes similar crazy errors when we don’t force it to answer quickly explicitly/intentionally, the perhaps it’s irrelevant.
Though it’s unclear to what extent LLMs will be always able to “take their time and think”. Sometimes you need to make the decision really fast. Doesn’t happen with the current LLMs, but quite likely will start happening in the future.
But otherwise: it would be good to be able to predict how models’ behavior might go wrong. When you give a task you understand to a human, and you can predict quite well possible mistakes they might make. In principle, LLMs could think similarly here: “aaa fast fast some number between 0 and 56 OK idk 20”. But they don’t.
Consider e.g. designing evaluations. You can’t cover all behaviors. So you cover behaviors where you expect something weird might happen. If LLMs reason in ways totally different from how humans do, this gets harder.
Or, to phrase this differently: suppose you’d want an AI system with some decent level of adversarial robustness. If there are cases where your AI system behaves in ways totally unpredictable, and you can’t find all of them, you won’t have the robustness.
(For clarity: I think the problem is not “59 instead of correct 56” but “59 instead of a wrong answer a human could give”.)
As far as I understand it, forcing the model to output ONLY the number is similar to asking a human to guess really quickly. I expect most humans’ actual intuitions to be more like “a thousand is a big number and dividing it by 57 yields something[1] a bit less than 20, but it doesn’t help me to estimate the remainder”. The model’s unknown algorithm, however, produces answers which are surprisingly close to the ground truth, differing by 0-3.
Why would the undiscovered algorithm that produces SUCH answers along with slop like 59 (vs. the right answer being 56) be bad for AI safety? Were the model allowed to think, it would’ve noticed that 59 is slop and correct it almost instantly.
P.S. In order to check my idea, I tried prompting Claude Sonnet 4.5 with variants of the same question: here, here, here, here, here. One of the results stood out in particular. When I prompted Claude and told it that I tested its ability to answer instantly, performance dropped to something closer to the lines of “1025 - a thousand”.
In reality 1025 = 57*18-1, so this quasi-estimate could also yield negative remainders close to 0.
OK, maybe my statement is too strong. Roughly, how I feel about it:
If you assume there are no cases where the model makes similar crazy errors when we don’t force it to answer quickly explicitly/intentionally, the perhaps it’s irrelevant.
Though it’s unclear to what extent LLMs will be always able to “take their time and think”. Sometimes you need to make the decision really fast. Doesn’t happen with the current LLMs, but quite likely will start happening in the future.
But otherwise: it would be good to be able to predict how models’ behavior might go wrong. When you give a task you understand to a human, and you can predict quite well possible mistakes they might make. In principle, LLMs could think similarly here: “aaa fast fast some number between 0 and 56 OK idk 20”. But they don’t.
Consider e.g. designing evaluations. You can’t cover all behaviors. So you cover behaviors where you expect something weird might happen. If LLMs reason in ways totally different from how humans do, this gets harder.
Or, to phrase this differently: suppose you’d want an AI system with some decent level of adversarial robustness. If there are cases where your AI system behaves in ways totally unpredictable, and you can’t find all of them, you won’t have the robustness.
(For clarity: I think the problem is not “59 instead of correct 56” but “59 instead of a wrong answer a human could give”.)