Interesting post. It was my understanding that Google’s counterpart to Kimi Linear and Deepseek Sparse for efficient long-horizon attention was built around the Titans architecture, which is publicly available. Is there something that leads you to believe that this is not the case, i.e. that Google abandoned this route and went with something different?
In any case, it makes me happy to see companies evolving in different directions with their architectures. That period in which everyone spent large sums of money doing more-or-less the same exact thing in what amounted to the same way felt very inefficient.
It’s a bit of a deepity but also a game theoretical conclusion that “if deepmind releases a paper it is either something groundbreaking or something they will never use in production”. The TITANS paper is about a year old now, and the MIRAS paper about 9 months old. you would think that some other frontier lab would have implemented it by now if it worked that well. I suspect a piece is missing here, or maybe the time between pre-training run and deployment is just way longer than I think it is and all the frontier labs are looking at this.
To my understanding TITANS requires you to do a backward pass during inference, this probably is a scaling disaster in inference as well, but maybe less so, since they do say that it can be done efficiently and in parallel. It’s unclear to me!
I mean, you may just be right. TITANS+MIRAS could be in the latter category. Gemma 3 (which we know does not use TITANS) for example probably benefits from a lot of RL environments, yet it absolutely sucks at this task. So it is possible that they are using it in production.
I guess like all things we will know for sure once the open chinese labs start doing it.
Something I see a lot in high-end ML is papers that are very simple conceptually, but very tricky to get working properly. I’d imagine that having the dozen or so guys who know all the hyperparameter tweaks to get public algorithm X to run accounts for a lot of ‘secret sauce’. Lots of people were wondering how DALL-E worked, but diffusion for quality image synthesis had been around for a while at the time, for instance.
Also, off the top of my head, I can’t think of an instance where a lab had an entire secret algorithm locked up that was the basis for their lead. Feels like it’s always been public papers getting used to their full potential.
I guess like all things we will know for sure once the open chinese labs start doing it.
Interesting post. It was my understanding that Google’s counterpart to Kimi Linear and Deepseek Sparse for efficient long-horizon attention was built around the Titans architecture, which is publicly available. Is there something that leads you to believe that this is not the case, i.e. that Google abandoned this route and went with something different?
In any case, it makes me happy to see companies evolving in different directions with their architectures. That period in which everyone spent large sums of money doing more-or-less the same exact thing in what amounted to the same way felt very inefficient.
It’s a bit of a deepity but also a game theoretical conclusion that “if deepmind releases a paper it is either something groundbreaking or something they will never use in production”. The TITANS paper is about a year old now, and the MIRAS paper about 9 months old. you would think that some other frontier lab would have implemented it by now if it worked that well. I suspect a piece is missing here, or maybe the time between pre-training run and deployment is just way longer than I think it is and all the frontier labs are looking at this.
To my understanding TITANS requires you to do a backward pass during inference, this probably is a scaling disaster in inference as well, but maybe less so, since they do say that it can be done efficiently and in parallel. It’s unclear to me!
I mean, you may just be right. TITANS+MIRAS could be in the latter category. Gemma 3 (which we know does not use TITANS) for example probably benefits from a lot of RL environments, yet it absolutely sucks at this task. So it is possible that they are using it in production.
I guess like all things we will know for sure once the open chinese labs start doing it.
Something I see a lot in high-end ML is papers that are very simple conceptually, but very tricky to get working properly. I’d imagine that having the dozen or so guys who know all the hyperparameter tweaks to get public algorithm X to run accounts for a lot of ‘secret sauce’. Lots of people were wondering how DALL-E worked, but diffusion for quality image synthesis had been around for a while at the time, for instance.
Also, off the top of my head, I can’t think of an instance where a lab had an entire secret algorithm locked up that was the basis for their lead. Feels like it’s always been public papers getting used to their full potential.
What a timeline, eh?