context length

Accelerating Inference with Skip Softmax in TensorRT-LLM

Skip Softmax is a technique designed to accelerate long-context inference in large language models (LLMs) by optimizing the attention computation process. It achieves this by dynamically pruning attention blocks that contribute minimally to the output, thereby reducing computation time without the need for retraining. This method is compatible with existing models and leverages NVIDIA's Hopper and Blackwell GPUs for enhanced performance, offering up to 1.4x speed improvements in both time-to-first-token and time-per-output-token. Skip Softmax maintains accuracy while providing substantial efficiency gains, making it a valuable tool for machine learning engineers working with long-context scenarios. This matters because it addresses the critical bottleneck of attention computation, enabling faster and more efficient deployment of LLMs at scale.
Read Full Article
Read Full Article: Accelerating Inference with Skip Softmax in TensorRT-LLM

Posted on

Dec 27, 2025

by

Neural Nix

in

Deep Dives, Tools

Topics: machine learning, performance optimization, NVIDIA GPUs
Adapting RoPE for Long Contexts

Rotary Position Embeddings (RoPE) are a method for encoding token positions in sequences, offering an advantage over traditional sinusoidal embeddings by focusing on relative rather than absolute positions. To adapt RoPE for longer context lengths, as seen in models like Llama 3.1, a scaling strategy is employed that modifies the frequency components. This involves applying a scaling factor to improve long-range stability at low frequencies while maintaining high-frequency information for local context. The technique allows models to handle both short and long contexts effectively by reallocating the RoPE scaling budget, ensuring that the model can capture dependencies within a wide range of token distances. This approach is crucial for enhancing the performance of language models on tasks requiring understanding of long sequences, which is increasingly important in natural language processing applications.
Read Full Article
Read Full Article: Adapting RoPE for Long Contexts

Posted on

Dec 26, 2025

by

Neural Nix

in

Deep Dives, Language

Topics: language models, NLP, context length

context length

Accelerating Inference with Skip Softmax in TensorRT-LLM

Adapting RoPE for Long Contexts

Popular AI Topics

More AI Articles