Modern foundation models are almost universally based on the Transformer architecture and its core attention module. While self-attention is highly effective because it routes information densely within a context window, it suffers from two fundamental drawbacks: an inability to model anything outside a finite window and quadratic scaling with respect to the sequence length.
This computational inefficiency makes it difficult for Transformers to handle extremely long sequences. To address this, a new class of models called selective state space models has emerged, with the Mamba architecture being a leading example.
Mamba is a fully recurrent model that achieves Transformer-quality performance while scaling linearly in sequence length. A key weakness of prior subquadratic models was their inability to perform content-based reasoning. Mamba overcomes this by using a selection mechanism that allows its parameters to be functions of the input. Mechanistically, this allows the model to selectively focus on or ignore particular inputs, effectively filtering out irrelevant noise and remembering relevant information indefinitely. This property is essential for discrete data modalities such as text and DNA modelling.
This efficiency allows Mamba to make use of extremely long contexts, with performance improving on real data up to sequences of one million tokens in length. For example, Mamba outperforms prior state-of-the-art models in modelling audio waveforms and DNA sequences.
To make Mamba efficient on modern hardware, researchers designed a hardware-aware parallel algorithm that avoids materialising the full expanded state in slow GPU memory. Instead, it leverages kernel fusion and parallel scans to compute the recurrence more quickly than previous methods, achieving up to five times higher throughput than Transformers. This efficiency allows Mamba to make use of extremely long contexts, with performance improving on real data up to sequences of one million tokens in length. For example, Mamba outperforms prior state-of-the-art models in modelling audio waveforms and DNA sequences.
In language modelling, Mamba is the first linear-time model to truly match or exceed the performance of strong Transformer recipes, such as those based on Llama. Its ability to reset its state at sequence boundaries also makes it ideal for handling packed documents or multi-episode reinforcement learning tasks. As we look toward emerging modalities requiring extremely long context, such as genomics and high-resolution video, selective state space models like Mamba are strong candidates to become the general backbone for the next generation of foundation models.
The team at Academii are always happy to discuss all your training and education needs, help your organisation attract and train new talent, and build a resilient workforce. Please drop us a line here to know more.













































































