I just tried multiplying 13-digit numbers with o3-mini (high).
My approach was to ask it to explain a basic multiplication algorithm to me, and then carry it out.
On the first try it was lazy and didn’t actually follow the algorithm (it just told me “it would take a long time to actually carry out all the shifts and multiplications...”, and it got the result wrong.
Then I told it to follow the algorithm, even if it is time consuming, and it did, and the result was correct.
So I’m not sure about the take that
The fact that something that has ingested the entirety of human literature can’t figure out how to generalize multiplication past 13 digits is actually a sign of the fact that it has no understanding of what a multiplication algorithm is.
The model got lazy, did some part of the calculation “in its head” (i.e. not actually following the algorithm but guesstimating the result, like we would do if we were asked to do a task like that without pencil and paper), and got the result slightly wrong—but when you ask it to actually follow the multiplication algorithm it just explained to me, it can absolutely do it.
I’d be interested in the CoT that led to the incorrect conclusion. If the model actually believed that it’s lazy estimation leads to the correct result, that shows that it’s overestimating its own capabilities—one could call this a fundamental misunderstanding of multiplication. I know that I’m incorrect when I’m estimating the result in my head, because I understand stuff about multiplication—or one could call it a failure of introspection.
The other possibility is that it didn’t care to produce an entirely correct result, and just didn’t bother and got lazy.
Yeah, same happens if you ask r1 to do it. It reasons that doing it manually would be too time-consuming and starts trying to find clever workarounds, assumes that there must be some hidden pattern in the numbers that lets you streamline the computation.
Which makes perfect sense: reasoning models weren’t trained to be brute-force calculators, they were trained to solve clever math puzzles. So, from the distribution of problems they faced, it’s a perfectly reasonable assumption to make that doing it the brute way is the “wrong” way.
I just tried multiplying 13-digit numbers with o3-mini (high). My approach was to ask it to explain a basic multiplication algorithm to me, and then carry it out. On the first try it was lazy and didn’t actually follow the algorithm (it just told me “it would take a long time to actually carry out all the shifts and multiplications...”, and it got the result wrong.
Then I told it to follow the algorithm, even if it is time consuming, and it did, and the result was correct.
So I’m not sure about the take that
The model got lazy, did some part of the calculation “in its head” (i.e. not actually following the algorithm but guesstimating the result, like we would do if we were asked to do a task like that without pencil and paper), and got the result slightly wrong—but when you ask it to actually follow the multiplication algorithm it just explained to me, it can absolutely do it.
I’d be interested in the CoT that led to the incorrect conclusion. If the model actually believed that it’s lazy estimation leads to the correct result, that shows that it’s overestimating its own capabilities—one could call this a fundamental misunderstanding of multiplication. I know that I’m incorrect when I’m estimating the result in my head, because I understand stuff about multiplication—or one could call it a failure of introspection.
The other possibility is that it didn’t care to produce an entirely correct result, and just didn’t bother and got lazy.
Yeah, same happens if you ask r1 to do it. It reasons that doing it manually would be too time-consuming and starts trying to find clever workarounds, assumes that there must be some hidden pattern in the numbers that lets you streamline the computation.
Which makes perfect sense: reasoning models weren’t trained to be brute-force calculators, they were trained to solve clever math puzzles. So, from the distribution of problems they faced, it’s a perfectly reasonable assumption to make that doing it the brute way is the “wrong” way.