One slightly counterintuitive thing about this paper is how little it improves on the GSM8K dataset, given that it does very well on relatively advanced test sets.
The Grade School Math, 8-K is a bundle of problems suitable for middle-schoolers. It has problems like:
“Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?”
“Randy has 60 mango trees on his farm. He also has 5 less than half as many coconut trees as mango trees. How many trees does Randy have in all on his farm?”
Minerva improves the SOTA on this, but only moves it from 74.5% to 78.5%, which is not as big of a deal.
My innate / naive sense of how hard the MATH problems are would lead me to think you could get > 90% on GSM8K if you could get 50% on MATH. But obviously my gut sense is off.
I’d be really curious to know what’s going on here.
One slightly counterintuitive thing about this paper is how little it improves on the GSM8K dataset, given that it does very well on relatively advanced test sets.
The Grade School Math, 8-K is a bundle of problems suitable for middle-schoolers. It has problems like:
“Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?”
“Randy has 60 mango trees on his farm. He also has 5 less than half as many coconut trees as mango trees. How many trees does Randy have in all on his farm?”
Minerva improves the SOTA on this, but only moves it from 74.5% to 78.5%, which is not as big of a deal.
My innate / naive sense of how hard the MATH problems are would lead me to think you could get > 90% on GSM8K if you could get 50% on MATH. But obviously my gut sense is off.
I’d be really curious to know what’s going on here.
The previous SOTA for MATH (https://arxiv.org/pdf/2009.03300.pdf) is a fine-tuned GPT-2 (1.5b params), whereas the previous SOTA for GSM8K (https://arxiv.org/pdf/2203.11171.pdf) is PaLM (540b params), using a similar “majority voting” method as Minerva (query each question ~40 times, take the most common answer).