context length

  • Accelerating Inference with Skip Softmax in TensorRT-LLM


    Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT-LLMSkip Softmax is a technique designed to accelerate long-context inference in large language models (LLMs) by optimizing the attention computation process. It achieves this by dynamically pruning attention blocks that contribute minimally to the output, thereby reducing computation time without the need for retraining. This method is compatible with existing models and leverages NVIDIA's Hopper and Blackwell GPUs for enhanced performance, offering up to 1.4x speed improvements in both time-to-first-token and time-per-output-token. Skip Softmax maintains accuracy while providing substantial efficiency gains, making it a valuable tool for machine learning engineers working with long-context scenarios. This matters because it addresses the critical bottleneck of attention computation, enabling faster and more efficient deployment of LLMs at scale.

    Read Full Article: Accelerating Inference with Skip Softmax in TensorRT-LLM

  • Adapting RoPE for Long Contexts


    Rotary Position Embeddings for Long Context LengthRotary Position Embeddings (RoPE) are a method for encoding token positions in sequences, offering an advantage over traditional sinusoidal embeddings by focusing on relative rather than absolute positions. To adapt RoPE for longer context lengths, as seen in models like Llama 3.1, a scaling strategy is employed that modifies the frequency components. This involves applying a scaling factor to improve long-range stability at low frequencies while maintaining high-frequency information for local context. The technique allows models to handle both short and long contexts effectively by reallocating the RoPE scaling budget, ensuring that the model can capture dependencies within a wide range of token distances. This approach is crucial for enhancing the performance of language models on tasks requiring understanding of long sequences, which is increasingly important in natural language processing applications.

    Read Full Article: Adapting RoPE for Long Contexts