Which to me seems like a huge improvement to its capabilities
This was actually my position when I started writing this post. My instincts told me that “thinking out loud” was a big enhancement to its capabilities. But then I started thinking about what I saw. I watched it spend tens of trillions of FLOPs to write out, in English, how to do a 3x3 matrix multiplication. It was so colossally inefficient, like building a humanoid robot and teaching it to use an abacus.
Then again, your analogy to humans is valid. We do a huge amount of processing internally, and then have this incredibly inefficient communication mechanism called writing, which we then use to solve very hard problems!
So my instincts point both ways on this, but I have nothing resembling rigorous proof one way or the other. So I’m pretty undecided.
Oh this is funny. It told me that it ran the code and got the answer [64, 91, 39, 47]. I checked that these satisfied the problem. But I didn’t check (until reviewing other comments) whether that’s actually what the code outputted. It’s not. Technically the code actually doesn’t output anything, it saves the result to a variable instead. And if I print that variable, it found [64, 6, 96, 75].
Lesson 1: I was not careful enough in checking its output, even when I thought I was being careful.
Lesson 2: It is indeed not running code, even if it tells me it is.