For me, the legible way Mamba initially appeared important (among other RNN/SSM architectures) is its scaling laws, see Figure 4 in the paper. It’s as good as LLaMA’s recipe, 5 times more training compute efficient than GPT-3, RWKV, and Hyena, 2 times more than RetNet and H3.
But consider the observations in the StripedHyena post (section “Hybridization”). It shows that a mixture of Transformer and Hyena has better training efficiency than either Transformer or Hyena alone. In particular, a hybrid 75% Hyena 25% Transformer network is 2 times more training efficient than pure Transformer, which is in turn 2 times more training efficient than pure Hyena. There are links there to earlier experiments to this effect, it’s not an isolated claim. So comparing pure architectures for training efficiency might be the wrong question to ask.
The fact that hybridisation works better than pure architectures (architectures consisting of a single core type of block, we shall say), is exactly the point that Nathan Labenz makes in the podcast and I repeat in the beginning of the post.
(Ah, I actually forgot to repeat this point, apart from noting that Doyle predicted this in his architecture theory.)
Experimental results is a more legible and reliable form of evidence than philosophy-level arguments. When it’s available, it’s the reason to start paying attention to the philosophy in a way the philosophy itself isn’t.
Incidentally, hybrid Mamba/MHA doesn’t work significantly better than pure Mamba, at least the way it’s reported in appendix E.2.2 of the paper (beware left/right confusion in Figure 9). The effect is much more visible with Hyena, though the StripedHyena post gives more details on studying hybridization, so it’s unclear if this was studied for Mamba as thoroughly.
For me, the legible way Mamba initially appeared important (among other RNN/SSM architectures) is its scaling laws, see Figure 4 in the paper. It’s as good as LLaMA’s recipe, 5 times more training compute efficient than GPT-3, RWKV, and Hyena, 2 times more than RetNet and H3.
But consider the observations in the StripedHyena post (section “Hybridization”). It shows that a mixture of Transformer and Hyena has better training efficiency than either Transformer or Hyena alone. In particular, a hybrid 75% Hyena 25% Transformer network is 2 times more training efficient than pure Transformer, which is in turn 2 times more training efficient than pure Hyena. There are links there to earlier experiments to this effect, it’s not an isolated claim. So comparing pure architectures for training efficiency might be the wrong question to ask.
The fact that hybridisation works better than pure architectures (architectures consisting of a single core type of block, we shall say), is exactly the point that Nathan Labenz makes in the podcast and I repeat in the beginning of the post.
(Ah, I actually forgot to repeat this point, apart from noting that Doyle predicted this in his architecture theory.)
Experimental results is a more legible and reliable form of evidence than philosophy-level arguments. When it’s available, it’s the reason to start paying attention to the philosophy in a way the philosophy itself isn’t.
Incidentally, hybrid Mamba/MHA doesn’t work significantly better than pure Mamba, at least the way it’s reported in appendix E.2.2 of the paper (beware left/right confusion in Figure 9). The effect is much more visible with Hyena, though the StripedHyena post gives more details on studying hybridization, so it’s unclear if this was studied for Mamba as thoroughly.