“But the models feel increasingly smarter!”:
It seems to me that “vibe checks” for how smart a model feels are easily gameable by making it have a better personality.
My guess is that it’s most of the reason Sonnet 3.5.1 was so beloved. Its personality was made much more appealing, compared to e. g. OpenAI’s corporate drones.
The recent upgrade to GPT-4o seems to confirm this. They seem to have merely given it a better personality, and people were reporting that it “feels much smarter”.
Deep Research was this for me, at first. Some of its summaries were just pleasant to read, they felt so information-dense and intelligent! Not like typical AI slop at all! But then it turned out most of it was just AI slop underneath anyway, and now my slop-recognition function has adjusted and the effect is gone.
My recent experiences with LLMs
I’ve been continuing to try to use LLMs to help with my research every few months, at least, to check whether they’re up to the task yet. From 2022 − 2024, the answer was a definitive “no”—informal research conversations would not go anywhere interesting, and requests for LLMs to solve well-defined mathematical problems related to my work would yield fake math that would only waste my time looking for the errors.
Claude 3.7 changed that. It was the first model that I could dump a bunch of notes into and ask questions, and get reasonably accurate answers. The value wasn’t so much about giving me new ideas—its attempted proofs were still BS, and its philosophical ideas still amateurish. The value was helping me get back up to speed with my notes much faster than I could otherwise do so by skimming them directly.
Ordinarily, for a large project with a lot of notes, I might need anywhere from a day to a week to load everything back into my head before I can “pick up where I left off” and start making meaningful progress again. Claude 3.7 helped me to dramatically reduce that loading time, by asking questions about what is in the notes. Its answers wouldn’t be perfect (it still often “rounds down” the ideas to something a bit more cliche), but they’d be good enough to remind me of what I had written & make it easier for me to search to find the relevant thing.
Relatedly, I started taking a lot more notes. I’ve kept extensive notes on whatever I’m thinking about since high school, but now my notes are significantly more useful, so it makes sense to write down even more thoughts.
Then, Claude 4 came out. Coincidentally, I had a research idea I wanted to work on around the same time—a rough intuition that I wanted to turn into a proof. I put my notes into Claude 4 and started iterating on my ideas, with roughly the following workflow:
Dump all my notes into a Claude project, including the latest attempted proof sketch.
Ask Claude to complete the project based on the notes, first writing definitions, then assumptions, then the theorem statement, then the proof. (This includes giving Claude some advice about where to go next, beyond what’s in the notes; which ideas currently seem most promising to me? If I’ve hit a snag, I’ll try to describe the snag to Claude in as much detail as I can, a process which is often more useful than Claude’s response.)
Claude will go off the rails at some point, but the parts which are already clear in my notes will usually be fine, and the first one or two steps after that might be fine as well, perhaps even containing good ideas.
Continue revising my personal notes, either based on some good ideas Claude might have had, or reacting to the badness of Claude’s ideas (seeing someone do it wrong can often be a helpful cue for doing it right).
This “gets me going” and often I’ll write for a while with no AI assistance.
When I hit a snag, or just get a bit bored/distracted and think Claude might be able to infer the next part correctly, return to step 1.
This overall process went quite well. I haven’t engaged in systematic testing, but I think it went much better than it would have gone with previous versions of Claude. (Perhaps I simply had a clearer task and better notes than what I’ve attempted in the past—I think there’s some of this. However, I believe I needed to use Claude 4 Opus to get this process to work so well, & Sonnet (4 or 3.7) weren’t up to the task.)
Occasionally, when I was especially stuck on something & Claude wasn’t helping me get unstuck, I would consult o3-pro. This was a mixed bag: o3-pro is even better than Claude at constructing BS proofs which look good on a quick read, invoking the right sorts of mathematical machinery, but on a close read, contain some critical error which amounts to assuming the conclusion. When questioned on the iffy steps, it can do this trick recursively (to a greater extent than Claude). As a result, o3 can waste a lot of my time on a dead-end approach. However, it can also sometimes produce exactly the insight I need (or close enough to get me there) in cases where Claude is just making dumb mistakes.
In the context of the OP, my main point here is that I don’t think any of this can be explained as “personality improvements”. My sense is, rather, that this is somewhat similar to the improvements we’ve seen in image generation over time. Image (and video) generation can now do a few human figures well, but as you grow the number of figures, you’ll start to see the sorts of errors that AI images used to be famous for. There’s a sort of “resolution” which has gotten better and better, but there’s still always a “conceptual resolution limit”. LLMs can multiply two-digit numbers where once they could only multiply 1-digit numbers, but ask for too many digits and you’ll quickly hit a limit.
Eisegesis is a better explanation for the sort of benefit I’m seeing, but eisegesis alone cannot explain the number of correct latex formulas AI generated for me. The reason eisegesis is working (where it wasn’t before) is because the “conceptual resolution” of the LLM has gotten fine enough to land somewhere close.
Yep, you did explicitly state that you expect LLMs to keep getting better along some dimensions; however, the quote I was responding to seemed too extreme in isolation. I agree that the vibe-bias is a thing (I’m manipulable to “sounding smart” too); I guess part of what I wanted to get across is that it really depends how you’re testing these things. If you have a serious use-case involving real cognitive labor & you keep going back to that when a new model is released, you’ll be much harder to fool by vibes.
& notably, it’s extremely slow to improve on some tasks (eg multiplication of numbers, even though multiplication is quadratic in number of tokens & transformers are also quadratic in number of tokens).
I somewhat think that conceptual resolution is still increasing about as quickly; it’s just that there are rapidly diminishing returns to conceptual resolution, because the distribution of tasks spans many orders of magnitude in conceptual-resolution-space. LLMs have adequate conceptual resolution for a lot of tasks, now, so even if conceptual resolution doubles, this just doesn’t “pop” like it did before.
(Meanwhile, my personal task-distribution has a conceptual resolution closer to the current frontier, so I am feeling very rapid improvement at the moment.)
Humans have almost-arbitrary conceptual resolution when needed (EG we can accurately multiply very long numbers if we need to), so many of the remaining tasks not conquered by current LLMs (EG professional-level math research) probably involve much higher conceptual resolution.