I’m directionally in agreement with you (In terms of AI 2027′s methodology, I think your assessment is spot on.), but some of these predictions feel ill-formed:
Large language models are still very stupid and make basic mistakes a 5-year-old would never make, as is true in 2025. Yet they are increasingly praised in the media for doing well on the SAT, math olympiad, etc., as in 2025.
I feel like this has already changed. The little tricks for breaking GPT-3[1] all fail on GPT-5. I admittedly don’t spend much time playing with LLMs, but most of what I throw at them consists of moderately challenging problems that’d take an average CS grad from a decent school 10-60 minutes, and it reasons through the problems that come up about as well as a human would. Just yesterday, I tortured Google Gemini through a browser security problem that I am absolutely certain does not look like anything it’s seen before. It did cycle through three plausible-but-wrong solution modes, initially, but with some moderately forceful nudges in the right direction, it produced a working one. About what I’d expect of a decent-but-not-brilliant human programmer fresh out of college.
If you’re referring to things like letter-counting, I’d argue that that’s not a fair comparison, on the basis that the input format would also make it incredibly difficult for a human to do those things. If you’re a reinforcement learning researcher/engineer/enthusiast, you know what I mean here—imagine looking at nothing but a pile of your input vectors, and trying to reason through their connection to the outputs with none of your human priors about what they mean, and what’s important to look at. Input featurization makes all the difference.
AI still can’t tell novel funny jokes, write clever prose, generate great business ideas, invent new in-demand products, or generate important scientific breakthroughs, except by accident.
“By accident”, when working with a stochastic black-box model, doesn’t mean much. Aside from that, I’ve already seen moderately clever prose from LLMs tasked to emulate human writers. Not quite enjoyable to read, but, insofar as there’s an objective binary standard for “clever”, I’d say the bar is passed.
More meaningfully, “<AI will not> generate important scientific breakthroughs, except by accident” is hard to defend, but also hard to critique because I’ve said the exact same thing a few days ago. A mathematician rubber duck debugging his way to a solution for an open mathematical problem with an LLM has happened, but we can’t really say how much or how little of a shared thought process came from the LLM. Making a connection between papers that the human wouldn’t have made, for instance, or surfacing the concepts that need to be put together even if it can’t quite articulate how, can be the core of a solution.
AI still can’t drive a damned car well enough that if I bought a car I wouldn’t have to.
Aren’t we already there? I think the issue is mainly regulatory, at this point.
As for the architecture section, a lot of the objections feel like already-solved problems, or problems that never were. The lack of recurrence and the lack of depth, in practice, seem to be solved with attention and autoregression. It’s unintuitive, but that unintuitiveness was expressed in the original paper’s title, after all. You address this counterargument, to your credit, but remember that transformers don’t just have one MHA step. The transformer ‘remembers’ (not exactly, given the new tokens afterwards, but perhaps closely enough) what it was thinking about each individual token in relation to every other token at each individual transformer layer along the way to each token that was previously output.
I’ll go on record saying that I don’t think neuralese is as much of a leap forwards from what we presently have, in terms of expressiveness, as it might appear.
AI girlfriends/boyfriends/companions do explode in popularity (the market at least doubles compared to now). It becomes common for particularly lonely young people to spend time interacting with them. This is driven as much by the loneliness epidemic as by AI progress, and the meaningful improvements that lead to this are in AI voice natural-ness and memory, not intelligence.
I’m in agreement, but I think multi-modality will be the big thing, here. The funniest possible outcome is that Elon figures out that he can double-dip on language-vision-audio-action models by getting them to operate Tesla robots in proof-of-concept demos and also getting them to generate natural poses for 3D waifu models that respond to the user’s words, tone, and general demeanor.
I’m happy to confirm that large language models still regularly make farcical mistakes in January 2026, when using them for novel, real-world problems.
Maybe once a day for me, an LLM makes an extremely elementary mistake. I usually avoid giving examples because they’re coding related and might not make sense if you don’t code (and because it’s easy to nitpick any example), but just yesterday, I was trying to replace a hardcoded SVG in my code with a HeroIcons React component.
The HeroIcons component matched almost exactly, except the stroke width was too thin on the new component. I wasn’t familiar with HeroIcons React components, so I asked Claude Opus 4.5 to make them match exactly in terms of thickness etc. Claude swore up and down there was no way to change the stroke of a HeroIcons component, and told me I’d have to “just use the inline SVG”, “accept the difference”, or simply “match by adjusting the inline SVG”. That is, in order to make them match, I should change not the new HeroIcon component, but the old SVG! That’s totally incoherent. After telling Claude that was nonsense, it went on to give me an overly verbose version of the correct answer (which was to use the stroke-2 TailwindCSS class).
A human being responding this way would be fireably incompetent. I think we tolerate it from LLMs because we’ve gotten used to it, and because they’re really fast, so if a query doesn’t work out, whatever, make a new one or solve your problem without LLMs. But yes, they are still very stupid, and say stupid stuff to me on a daily basis.
“By accident”, when working with a stochastic black-box model, doesn’t mean much. Aside from that, I’ve already seen moderately clever prose from LLMs tasked to emulate human writers. Not quite enjoyable to read, but, insofar as there’s an objective binary standard for “clever”, I’d say the bar is passed.
I would love to see an example. I saw people saying something similar about AI poetry once, but then I read the poetry and it was trash. The people just didn’t have any taste when it came to poetry, and thought any vague language that rhymed basically constituted an amazing poem.
A mathematician rubber duck debugging his way to a solution for an open mathematical problem with an LLM has happened, but we can’t really say how much or how little of a shared thought process came from the LLM.
I personally won’t give the AI any credit unless it does it by itself. After all, I said AI won’t generate important scientific breakthroughs, not that it couldn’t be used in some way to help generate a breakthrough.
AI still can’t drive a damned car well enough that if I bought a car I wouldn’t have to.
Aren’t we already there? I think the issue is mainly regulatory, at this point.
Are we? Even in a blizzard at night? I’m Canadian, so that matters for me.
As for the stuff about the architecture, I thought about it more, and developed more fleshed out ideas about what I think is limited about the architecture in later posts. I think the fact the LLMs don’t update their weights during thinking is pretty limiting (online learning). But I’m essentially expanding on my intuition, and I have less to bring to the table in terms of analyzing the architecture than I do in saying “this doesn’t work as well as people say it does right now, this hasn’t improved as much as people say it has, and the benchmarks are a lie.”
The HeroIcons component matched almost exactly, except the stroke width was too thin on the new component. I wasn’t familiar with HeroIcons React components, so I asked Claude Opus 4.5 to make them match exactly in terms of thickness etc. Claude swore up and down there was no way to change the stroke of a HeroIcons component, and told me I’d have to “just use the inline SVG”, “accept the difference”, or simply “match by adjusting the inline SVG”.
Maybe I’m just tired, but the wording of this felt a little bit unclear, to the point I had to read it over again to discern that you were discarding the old component and wanted the new one to ‘match’ it by looking similar to what it replaced, rather than adding a new component somewhere but leaving the old one in place elsewhere in your branding with the desire that things remain visually consistent. I wasn’t there, of course, but, depending on the phrasing, I could see a human making this mistake.
I would love to see an example. I saw people saying something similar about AI poetry once, but then I read the poetry and it was trash. The people just didn’t have any taste when it came to poetry, and thought any vague language that rhymed basically constituted an amazing poem.
I saw one a while back, but I unfortunately don’t remember where it was. To summarize, it had some fairly standard but seemingly original jokes, a coherent plot, and a theme at the end that came through throughout. I think Claude was used to generate it. Not something I’d read for fun, but I didn’t see anything missing from it on a qualitative level. I was left thinking that a smarter model, ideally without that ubiquitous tone, which evokes the Corporate Memphis art style and comes standard in the post-training of every LLM currently on the market, could write something legitimately good.
I personally won’t give the AI any credit unless it does it by itself. After all, I said AI won’t generate important scientific breakthroughs, not that it couldn’t be used in some way to help generate a breakthrough.
The challenge, here, is that there isn’t a universally agreed-upon atomic unit for a research contribution (Alas, Salami Slicing will continue to exist). If I write out some lemmas which constitute X percent of the work needed to solve an open problem, give them to a friend, and he pieces them together to write the proof that was needed, did he make a contribution by himself?
I’m not a mathematics researcher, so I can’t say exactly how impressive this is, but Terrance Tao thinks that the meaningful autonomous progress box has been ticked.
Are we? Even in a blizzard at night? I’m Canadian, so that matters for me.
There are videos, but at the end of the day, performance is ambiguous enough that two people can look at them and draw different conclusions. Human driver quality comes from a wide distribution; I’d wager that self-driving cars can do better in a Canadian blizzard than a kid who just got his license in CA, but would falter in situations that an Alaskan roughneck with 40 years of experience could handle readily.
I’m directionally in agreement with you (In terms of AI 2027′s methodology, I think your assessment is spot on.), but some of these predictions feel ill-formed:
I feel like this has already changed. The little tricks for breaking GPT-3[1] all fail on GPT-5. I admittedly don’t spend much time playing with LLMs, but most of what I throw at them consists of moderately challenging problems that’d take an average CS grad from a decent school 10-60 minutes, and it reasons through the problems that come up about as well as a human would. Just yesterday, I tortured Google Gemini through a browser security problem that I am absolutely certain does not look like anything it’s seen before. It did cycle through three plausible-but-wrong solution modes, initially, but with some moderately forceful nudges in the right direction, it produced a working one. About what I’d expect of a decent-but-not-brilliant human programmer fresh out of college.
If you’re referring to things like letter-counting, I’d argue that that’s not a fair comparison, on the basis that the input format would also make it incredibly difficult for a human to do those things. If you’re a reinforcement learning researcher/engineer/enthusiast, you know what I mean here—imagine looking at nothing but a pile of your input vectors, and trying to reason through their connection to the outputs with none of your human priors about what they mean, and what’s important to look at. Input featurization makes all the difference.
“By accident”, when working with a stochastic black-box model, doesn’t mean much. Aside from that, I’ve already seen moderately clever prose from LLMs tasked to emulate human writers. Not quite enjoyable to read, but, insofar as there’s an objective binary standard for “clever”, I’d say the bar is passed.
More meaningfully, “<AI will not> generate important scientific breakthroughs, except by accident” is hard to defend, but also hard to critique because I’ve said the exact same thing a few days ago. A mathematician rubber duck debugging his way to a solution for an open mathematical problem with an LLM has happened, but we can’t really say how much or how little of a shared thought process came from the LLM. Making a connection between papers that the human wouldn’t have made, for instance, or surfacing the concepts that need to be put together even if it can’t quite articulate how, can be the core of a solution.
Aren’t we already there? I think the issue is mainly regulatory, at this point.
As for the architecture section, a lot of the objections feel like already-solved problems, or problems that never were. The lack of recurrence and the lack of depth, in practice, seem to be solved with attention and autoregression. It’s unintuitive, but that unintuitiveness was expressed in the original paper’s title, after all. You address this counterargument, to your credit, but remember that transformers don’t just have one MHA step. The transformer ‘remembers’ (not exactly, given the new tokens afterwards, but perhaps closely enough) what it was thinking about each individual token in relation to every other token at each individual transformer layer along the way to each token that was previously output.
I’ll go on record saying that I don’t think neuralese is as much of a leap forwards from what we presently have, in terms of expressiveness, as it might appear.
I’m in agreement, but I think multi-modality will be the big thing, here. The funniest possible outcome is that Elon figures out that he can double-dip on language-vision-audio-action models by getting them to operate Tesla robots in proof-of-concept demos and also getting them to generate natural poses for 3D waifu models that respond to the user’s words, tone, and general demeanor.
“The ball is red, the cat is blue. Since the ball is blue, what color is the cat?”
I’m happy to confirm that large language models still regularly make farcical mistakes in January 2026, when using them for novel, real-world problems.
Maybe once a day for me, an LLM makes an extremely elementary mistake. I usually avoid giving examples because they’re coding related and might not make sense if you don’t code (and because it’s easy to nitpick any example), but just yesterday, I was trying to replace a hardcoded SVG in my code with a HeroIcons React component.
The SVG looked like this:
And the new HeroIcons component looked like this:
The HeroIcons component matched almost exactly, except the stroke width was too thin on the new component. I wasn’t familiar with HeroIcons React components, so I asked Claude Opus 4.5 to make them match exactly in terms of thickness etc. Claude swore up and down there was no way to change the stroke of a HeroIcons component, and told me I’d have to “just use the inline SVG”, “accept the difference”, or simply “match by adjusting the inline SVG”. That is, in order to make them match, I should change not the new HeroIcon component, but the old SVG! That’s totally incoherent. After telling Claude that was nonsense, it went on to give me an overly verbose version of the correct answer (which was to use the
stroke-2TailwindCSS class).A human being responding this way would be fireably incompetent. I think we tolerate it from LLMs because we’ve gotten used to it, and because they’re really fast, so if a query doesn’t work out, whatever, make a new one or solve your problem without LLMs. But yes, they are still very stupid, and say stupid stuff to me on a daily basis.
I would love to see an example. I saw people saying something similar about AI poetry once, but then I read the poetry and it was trash. The people just didn’t have any taste when it came to poetry, and thought any vague language that rhymed basically constituted an amazing poem.
I personally won’t give the AI any credit unless it does it by itself. After all, I said AI won’t generate important scientific breakthroughs, not that it couldn’t be used in some way to help generate a breakthrough.
Are we? Even in a blizzard at night? I’m Canadian, so that matters for me.
As for the stuff about the architecture, I thought about it more, and developed more fleshed out ideas about what I think is limited about the architecture in later posts. I think the fact the LLMs don’t update their weights during thinking is pretty limiting (online learning). But I’m essentially expanding on my intuition, and I have less to bring to the table in terms of analyzing the architecture than I do in saying “this doesn’t work as well as people say it does right now, this hasn’t improved as much as people say it has, and the benchmarks are a lie.”
Maybe I’m just tired, but the wording of this felt a little bit unclear, to the point I had to read it over again to discern that you were discarding the old component and wanted the new one to ‘match’ it by looking similar to what it replaced, rather than adding a new component somewhere but leaving the old one in place elsewhere in your branding with the desire that things remain visually consistent. I wasn’t there, of course, but, depending on the phrasing, I could see a human making this mistake.
I saw one a while back, but I unfortunately don’t remember where it was. To summarize, it had some fairly standard but seemingly original jokes, a coherent plot, and a theme at the end that came through throughout. I think Claude was used to generate it. Not something I’d read for fun, but I didn’t see anything missing from it on a qualitative level. I was left thinking that a smarter model, ideally without that ubiquitous tone, which evokes the Corporate Memphis art style and comes standard in the post-training of every LLM currently on the market, could write something legitimately good.
The challenge, here, is that there isn’t a universally agreed-upon atomic unit for a research contribution (Alas, Salami Slicing will continue to exist). If I write out some lemmas which constitute X percent of the work needed to solve an open problem, give them to a friend, and he pieces them together to write the proof that was needed, did he make a contribution by himself?
I’m not a mathematics researcher, so I can’t say exactly how impressive this is, but Terrance Tao thinks that the
meaningful autonomous progressbox has been ticked.There are videos, but at the end of the day, performance is ambiguous enough that two people can look at them and draw different conclusions. Human driver quality comes from a wide distribution; I’d wager that self-driving cars can do better in a Canadian blizzard than a kid who just got his license in CA, but would falter in situations that an Alaskan roughneck with 40 years of experience could handle readily.