There is an awful lot of “promising new architectures” being thrown around. Few have demonstrated any notable results whatsoever. Fewer still have demonstrated their ability to compete with transformer LLMs on the kind of task transformer LLMs are well suited for.
It’s basically just Mamba SSM and diffusion models, and they aren’t “better LLMs”. They seem like sidegrades to transformer LLMs at best.
HRMs, for example, seem to do incredibly, suspiciously well on certain kinds of puzzles, but I’m yet to see them do anything in language domain, or in math, coding, etc. Are HRMs generalists, like transformers? No evidence of that yet.
Concretely, these are the developments I am predicting within the next six months (i.e. before Feb 1st 2026) with ~75% probability:
Basically, off the top of my head: I’d put 10% on that. Too short of a timeframe.
SSMs are really quite similar to transformers. Similar to all the “sub-quadratic” transformer variants the expectation is at best that they will do the same thing but more efficiently than transformers.
HRMs or continuous thought machines or KANs on the other hand contain new and different ideas that make a discontinuous jump in abilities at least conceivable. So I think one should distinguish between those two types of “promising new architectures”.
My view is that these new ideas accumulate and at some points somebody will be able to put them together in a new way to build actual AGI.
But the authors of these papers are not stupid. If there was straightforward applicability to language modelling they would already have done that. If there was line of sight for GPT4 level abilities in six month they probably wouldn’t publish the paper.
I think it is a cool idea and has its application but you are right that it seems very unlikely to contribute to AGI in any way. But there was nonetheless excitement about integrating KANs into transformers which was easy to do but just didn’t improve anything.
Ah, but is it a point-in-time sidegrade with a faster capability curve in the future? At the scale we are working now, even a marginal efficiency improvement threatens to considerably accelerate at least the conventional concerns (power concentration, job loss, etc).
It’s my impression that a lot of the “promising new architectures” are indeed promising. IMO a lot of them could compete with transformers if you invest in them. It just isn’t worth the risk while the transformer gold-mine is still open. Why do you disagree?
I disagree because I’m yet to see any of those “promising new architectures” outperform even something like GPT-2 345M, weight for weight, at similar tasks. Or show similar performance with a radical reduction in dataset size. Or anything of the sort.
I don’t doubt that a better architecture than LLM is possible. But if we’re talking AGI, then we need an actual general architecture. Not a benchmark-specific AI that destroys a specific benchmark, but a more general purpose AI that happens to do reasonably well at a variety of benchmarks it wasn’t purposefully trained for.
There is an awful lot of “promising new architectures” being thrown around. Few have demonstrated any notable results whatsoever. Fewer still have demonstrated their ability to compete with transformer LLMs on the kind of task transformer LLMs are well suited for.
It’s basically just Mamba SSM and diffusion models, and they aren’t “better LLMs”. They seem like sidegrades to transformer LLMs at best.
HRMs, for example, seem to do incredibly, suspiciously well on certain kinds of puzzles, but I’m yet to see them do anything in language domain, or in math, coding, etc. Are HRMs generalists, like transformers? No evidence of that yet.
Basically, off the top of my head: I’d put 10% on that. Too short of a timeframe.
SSMs are really quite similar to transformers. Similar to all the “sub-quadratic” transformer variants the expectation is at best that they will do the same thing but more efficiently than transformers.
HRMs or continuous thought machines or KANs on the other hand contain new and different ideas that make a discontinuous jump in abilities at least conceivable. So I think one should distinguish between those two types of “promising new architectures”.
My view is that these new ideas accumulate and at some points somebody will be able to put them together in a new way to build actual AGI.
But the authors of these papers are not stupid. If there was straightforward applicability to language modelling they would already have done that. If there was line of sight for GPT4 level abilities in six month they probably wouldn’t publish the paper.
KANs seem obviously of limited utility to me...?
I think it is a cool idea and has its application but you are right that it seems very unlikely to contribute to AGI in any way. But there was nonetheless excitement about integrating KANs into transformers which was easy to do but just didn’t improve anything.
Ah, but is it a point-in-time sidegrade with a faster capability curve in the future? At the scale we are working now, even a marginal efficiency improvement threatens to considerably accelerate at least the conventional concerns (power concentration, job loss, etc).
It’s my impression that a lot of the “promising new architectures” are indeed promising. IMO a lot of them could compete with transformers if you invest in them. It just isn’t worth the risk while the transformer gold-mine is still open. Why do you disagree?
I disagree because I’m yet to see any of those “promising new architectures” outperform even something like GPT-2 345M, weight for weight, at similar tasks. Or show similar performance with a radical reduction in dataset size. Or anything of the sort.
I don’t doubt that a better architecture than LLM is possible. But if we’re talking AGI, then we need an actual general architecture. Not a benchmark-specific AI that destroys a specific benchmark, but a more general purpose AI that happens to do reasonably well at a variety of benchmarks it wasn’t purposefully trained for.
We aren’t exactly swimming in that kind of thing.