I feel like I get some many of these bonkers inferences, that there’s something interesting here to reconcile with the brilliance they have in other moments.
They’re still bad at generalizing out of distribution. Tons of data are shoveled into them, and they are trained to produce reasonably good (or very good) reasoning outputs with this data (etc.), but put them OOD and they break.
Of course, there’s much more juice there to be figured out, but I still think that this is a good, if simplistic, model. (See also: hyperpolation)
An example from my recent experience is getting Claude to use some relatively uncommon CLI tools, with documentation in the repo. It would try running non-existent (but reasonably looking) commands with non-existent (but reasonably looking) arguments. It would try like 6 times and only then would look into the documentation. It would repeat this mistake of using the same non-existent commands a few vibe-code steps later.
The “use uncommon tools” example is familiar. Last year, I was really amazed by what Claude/Cursor could do in primary coding tasks, then appalled by how poorly that transferred to asking it to work with Jupyter/iPython notebooks via MCP. We’d been working on a notebook for 30 min, then it would screw up the tool call, conclude the notebook had been deleted, and attempt to create it fresh. This happened repeatedly. It’s just not the kind of mistake a human would make, which gets back to, how exactly do these minds work and form models of the world?
They’re still bad at generalizing out of distribution. Tons of data are shoveled into them, and they are trained to produce reasonably good (or very good) reasoning outputs with this data (etc.), but put them OOD and they break.
Of course, there’s much more juice there to be figured out, but I still think that this is a good, if simplistic, model. (See also: hyperpolation)
An example from my recent experience is getting Claude to use some relatively uncommon CLI tools, with documentation in the repo. It would try running non-existent (but reasonably looking) commands with non-existent (but reasonably looking) arguments. It would try like 6 times and only then would look into the documentation. It would repeat this mistake of using the same non-existent commands a few vibe-code steps later.
The “use uncommon tools” example is familiar. Last year, I was really amazed by what Claude/Cursor could do in primary coding tasks, then appalled by how poorly that transferred to asking it to work with Jupyter/iPython notebooks via MCP. We’d been working on a notebook for 30 min, then it would screw up the tool call, conclude the notebook had been deleted, and attempt to create it fresh. This happened repeatedly. It’s just not the kind of mistake a human would make, which gets back to, how exactly do these minds work and form models of the world?