I feel like I get some many of these bonkers inferences, that there’s something interesting here to reconcile with the brilliance they have in other moments.
They’re still bad at generalizing out of distribution. Tons of data are shoveled into them, and they are trained to produce reasonably good (or very good) reasoning outputs with this data (etc.), but put them OOD and they break.
Of course, there’s much more juice there to be figured out, but I still think that this is a good, if simplistic, model. (See also: hyperpolation)
An example from my recent experience is getting Claude to use some relatively uncommon CLI tools, with documentation in the repo. It would try running non-existent (but reasonably looking) commands with non-existent (but reasonably looking) arguments. It would try like 6 times and only then would look into the documentation. It would repeat this mistake of using the same non-existent commands a few vibe-code steps later.
The “use uncommon tools” example is familiar. Last year, I was really amazed by what Claude/Cursor could do in primary coding tasks, then appalled by how poorly that transferred to asking it to work with Jupyter/iPython notebooks via MCP. We’d been working on a notebook for 30 min, then it would screw up the tool call, conclude the notebook had been deleted, and attempt to create it fresh. This happened repeatedly. It’s just not the kind of mistake a human would make, which gets back to, how exactly do these minds work and form models of the world?
We’re also bad OOD and many of our supposed advantages over them boil down to our distribution differences (embodiment and first-person-first data). I agree we’re much better OOD than them but not so much that I think there’s no comparison. As usual I’m skipping over my ideas for ways to improve them.
We’re also bad OOD and many of our supposed advantages over them boil down to our distribution differences (embodiment and first-person-first data).
Kind of and yeah?
I agree we’re much better OOD than them but not so much that I think there’s no comparison.
I wouldn’t say “there’s no comparison”[1], but I do think it looks like a “qualitative” difference. What exactly it is would require a more involved explication of the concept, which might be infohazardous.
They’re still bad at generalizing out of distribution. Tons of data are shoveled into them, and they are trained to produce reasonably good (or very good) reasoning outputs with this data (etc.), but put them OOD and they break.
Of course, there’s much more juice there to be figured out, but I still think that this is a good, if simplistic, model. (See also: hyperpolation)
An example from my recent experience is getting Claude to use some relatively uncommon CLI tools, with documentation in the repo. It would try running non-existent (but reasonably looking) commands with non-existent (but reasonably looking) arguments. It would try like 6 times and only then would look into the documentation. It would repeat this mistake of using the same non-existent commands a few vibe-code steps later.
The “use uncommon tools” example is familiar. Last year, I was really amazed by what Claude/Cursor could do in primary coding tasks, then appalled by how poorly that transferred to asking it to work with Jupyter/iPython notebooks via MCP. We’d been working on a notebook for 30 min, then it would screw up the tool call, conclude the notebook had been deleted, and attempt to create it fresh. This happened repeatedly. It’s just not the kind of mistake a human would make, which gets back to, how exactly do these minds work and form models of the world?
We’re also bad OOD and many of our supposed advantages over them boil down to our distribution differences (embodiment and first-person-first data). I agree we’re much better OOD than them but not so much that I think there’s no comparison. As usual I’m skipping over my ideas for ways to improve them.
Kind of and yeah?
I wouldn’t say “there’s no comparison”[1], but I do think it looks like a “qualitative” difference. What exactly it is would require a more involved explication of the concept, which might be infohazardous.
Not really my way of speaking about this sort of stuff / I’m not sure what you mean by this.