Interesting. What does the distribution of errors look like for numerical questions? Are the answers often close even when they’re wrong?
This might not be quite the right test. I’m wondering whether transformers learn to redundantly approximately encode info and size improves the approximation. But we could have intermediate numerical steps that are “close” while the final answer is far off, so my test might not be great. I also don’t really know why they’d be incentivised to learn redundant approximate encodings without also being incentivised to learn redundant precise encodings.
I think the distribution of errors for numerical n-hop questions is going to be uninteresting/random most of the time because the only questions with numerical answers are either “day of month X was born” or “number of counties in US state X” where there isn’t any real reason for AIs to be close if they are wrong (either about X or the property of X)”.
However, I was interested in this question for “adding the result of N 1-hop questions”, so I got opus to do some additional analysis (for Gemini 3 Pro with 300 filler tokens):
The most interesting result here is that even on 6 addends, the model gets 75% within +-10 even though the answers are pretty big (median is 288 and many answers are much bigger).
Also, for some reason the model is systematically low, especially for 3 and 4 addends. Maybe because it sometimes skips one of the numbers?
Interesting. What does the distribution of errors look like for numerical questions? Are the answers often close even when they’re wrong?
This might not be quite the right test. I’m wondering whether transformers learn to redundantly approximately encode info and size improves the approximation. But we could have intermediate numerical steps that are “close” while the final answer is far off, so my test might not be great. I also don’t really know why they’d be incentivised to learn redundant approximate encodings without also being incentivised to learn redundant precise encodings.
I think the distribution of errors for numerical n-hop questions is going to be uninteresting/random most of the time because the only questions with numerical answers are either “day of month X was born” or “number of counties in US state X” where there isn’t any real reason for AIs to be close if they are wrong (either about X or the property of X)”.
However, I was interested in this question for “adding the result of N 1-hop questions”, so I got opus to do some additional analysis (for Gemini 3 Pro with 300 filler tokens):
The most interesting result here is that even on 6 addends, the model gets 75% within +-10 even though the answers are pretty big (median is 288 and many answers are much bigger).
Also, for some reason the model is systematically low, especially for 3 and 4 addends. Maybe because it sometimes skips one of the numbers?