It is possible that state tracking could be the next reasoning-tier breakthrough in frontier model capabilities. I believe that there exists strong evidence in favor of this being the case.
State space models already power the fastest available voice models, such as Cartesia’s Sonic (time-to-first-audio advertised as under 40ms). There are examples of SSMs such as Mamba, RWKV, and Titans outperforming transformers in research settings.
Flagship LLMs are also bad at state tracking, even with RL for summarization. Forcing an explicit schema added to the top of every message is one of the less elegant solutions used to fix this. Tracker is the second most popular extension for SillyTavern, as measured by the number of upvotes or comments on forum posts in the SillyTavern Discord server. The top spot in the list of extensions as ranked by this popularity metric is stepped thinking, though note that its release date was in October 2024, so well after the CoT paper by Kojima et al (2023), and about one month after OpenAI’s public release of o1-preview. Although Tracker was released one month after stepped thinking (i.e, before Structured Outputs but after JSON mode), it has overtaken memory extensions which were released earlier, this could reflect biases in the distribution of human raters who may reward polished UI/UX for narrow workflows instead of pure effectiveness at consistently maintaining persistent tracking data over long context lengths.
There have been instances of scaffolding being useful at lower capability levels before becoming obviated upon the release of a more capable model which can natively perform the previously assisted task without needing to rely on external tools. For example, observe that the stepped thinking extension is redundant if you are already using a reasoning model. Also note how web search queries risk polluting the context with low quality spam or intentionally poisoned data. Scoping to a trusted list of verified sources is not enough as external documentation may not be task-relevant; we often find it desirable to ask humans to write in their own words. This is one reason why retrieval augmented generation (RAG) often hurts performance; I am confident that RAG is doomed.
I am only aware of one published work by Ensign and Garriga-Alonso (2024) applying circuits-based interpretability tooling (positional Edge Attribution Patching) to Mamba, which finds that layer 39 (out of 56 layers total) is important— though per Belrose et. al (2024) the middle layers are best for steering. I am unsure whether SSMs are fundamentally more or less interpretable than transformers, I personally weakly lean towards more, though I could be wrong.
It is possible that state tracking could be the next reasoning-tier breakthrough in frontier model capabilities. I believe that there exists strong evidence in favor of this being the case.
State space models already power the fastest available voice models, such as Cartesia’s Sonic (time-to-first-audio advertised as under 40ms). There are examples of SSMs such as Mamba, RWKV, and Titans outperforming transformers in research settings.
Flagship LLMs are also bad at state tracking, even with RL for summarization. Forcing an explicit schema added to the top of every message is one of the less elegant solutions used to fix this. Tracker is the second most popular extension for SillyTavern, as measured by the number of upvotes or comments on forum posts in the SillyTavern Discord server. The top spot in the list of extensions as ranked by this popularity metric is stepped thinking, though note that its release date was in October 2024, so well after the CoT paper by Kojima et al (2023), and about one month after OpenAI’s public release of o1-preview. Although Tracker was released one month after stepped thinking (i.e, before Structured Outputs but after JSON mode), it has overtaken memory extensions which were released earlier, this could reflect biases in the distribution of human raters who may reward polished UI/UX for narrow workflows instead of pure effectiveness at consistently maintaining persistent tracking data over long context lengths.
There have been instances of scaffolding being useful at lower capability levels before becoming obviated upon the release of a more capable model which can natively perform the previously assisted task without needing to rely on external tools. For example, observe that the stepped thinking extension is redundant if you are already using a reasoning model. Also note how web search queries risk polluting the context with low quality spam or intentionally poisoned data. Scoping to a trusted list of verified sources is not enough as external documentation may not be task-relevant; we often find it desirable to ask humans to write in their own words. This is one reason why retrieval augmented generation (RAG) often hurts performance; I am confident that RAG is doomed.
I am only aware of one published work by Ensign and Garriga-Alonso (2024) applying circuits-based interpretability tooling (positional Edge Attribution Patching) to Mamba, which finds that layer 39 (out of 56 layers total) is important— though per Belrose et. al (2024) the middle layers are best for steering. I am unsure whether SSMs are fundamentally more or less interpretable than transformers, I personally weakly lean towards more, though I could be wrong.