Recent advancements in sequence modeling have introduced a new State Space Model (SSM) architecture that surpasses traditional Transformers by addressing their O(L^2) complexity limitation for long sequences. By integrating delta-rule updates with the powerful representational capabilities of gated convolutions, this new architecture achieves O(n) complexity, making it a strong baseline for sequence modeling tasks. The architecture not only matches but exceeds the performance and speed of Transformers, even with relatively short sequence lengths, thanks to the use of mildly optimized Triton kernels. This development is significant as it provides a more efficient and scalable solution for processing long sequences in natural language processing and other domains.
Read Full Article: New SSM Architecture Exceeds Transformer Baseline