The question is the one that in part reads “the variance of the first n natural numbers is 10”. The model’s output states, without any reasoning, that this variance is equal to (n^2 − 1)/12, which is correct. Since no reasoning was used, I think it’s safe to assume that the model memorized this formula.
This is not a formula that a random math student would be expected to have memorized. (Anecdotally, I have a mathematics degree and don’t know it.) Because of that, I’d expect that a typical (human) solver would need to derive the formula on the spot. It also strikes me as the sort of knowledge that would be unlikely to matter outside a contest, exam, etc.
That all leads me to think that the model might be over-fitting somewhat to contest/exam/etc.-style questions. By that I mean that it might be memorizing facts that are useful when answering such questions but are not useful when doing math more broadly.
To be clear, there are other aspects of the model output, here and in other questions, that seem genuinely impressive in terms of reasoning ability. But the headline accuracy rate might be inflated by memorization.
Here’s something odd that I noticed in one of the examples in the blogpost (https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html).
The question is the one that in part reads “the variance of the first n natural numbers is 10”. The model’s output states, without any reasoning, that this variance is equal to (n^2 − 1)/12, which is correct. Since no reasoning was used, I think it’s safe to assume that the model memorized this formula.
This is not a formula that a random math student would be expected to have memorized. (Anecdotally, I have a mathematics degree and don’t know it.) Because of that, I’d expect that a typical (human) solver would need to derive the formula on the spot. It also strikes me as the sort of knowledge that would be unlikely to matter outside a contest, exam, etc.
That all leads me to think that the model might be over-fitting somewhat to contest/exam/etc.-style questions. By that I mean that it might be memorizing facts that are useful when answering such questions but are not useful when doing math more broadly.
To be clear, there are other aspects of the model output, here and in other questions, that seem genuinely impressive in terms of reasoning ability. But the headline accuracy rate might be inflated by memorization.