Triton kernels
-
Software FP8 for GPUs: 3x Speedup on Memory Operations
Read Full Article: Software FP8 for GPUs: 3x Speedup on Memory Operations
A workaround has been developed to enable FP8 support on GPUs that lack native hardware support, such as the RTX 3050. This method involves packing lower-precision values into FP32 using bitwise operations and Triton kernels, resulting in a threefold speed increase on memory-bound operations like GEMV and FlashAttention. The solution is compatible with a wide range of GPUs, including the RTX 30/20 series and older models. Although still in the early stages, it is functional and open for feedback from the community. This matters because it offers a significant performance boost for users with older or less advanced GPUs, expanding their capabilities without requiring hardware upgrades.
-
New SSM Architecture Exceeds Transformer Baseline
Read Full Article: New SSM Architecture Exceeds Transformer Baseline
Recent advancements in sequence modeling have introduced a new State Space Model (SSM) architecture that surpasses traditional Transformers by addressing their O(L^2) complexity limitation for long sequences. By integrating delta-rule updates with the powerful representational capabilities of gated convolutions, this new architecture achieves O(n) complexity, making it a strong baseline for sequence modeling tasks. The architecture not only matches but exceeds the performance and speed of Transformers, even with relatively short sequence lengths, thanks to the use of mildly optimized Triton kernels. This development is significant as it provides a more efficient and scalable solution for processing long sequences in natural language processing and other domains.
