TensorRT-LLM
-
NVIDIA’s Blackwell Boosts AI Inference Performance
Read Full Article: NVIDIA’s Blackwell Boosts AI Inference Performance
NVIDIA's Blackwell architecture is delivering significant performance improvements for AI inference, particularly in handling the demands of sparse mixture-of-experts (MoE) models like DeepSeek-R1. By optimizing the entire technology stack, including GPUs, CPUs, networking, and software, NVIDIA enhances token throughput per watt, reducing costs and extending the productivity of existing infrastructure. Recent updates to the NVIDIA inference software stack, such as TensorRT-LLM, have increased throughput by up to 2.8x, leveraging innovations like NVFP4 data format and multi-token prediction (MTP). These advancements enable NVIDIA's platforms, like the GB200 NVL72 and HGX B200, to deliver industry-leading performance, efficiently supporting large AI models and enhancing user experiences. This matters because it allows AI platforms to serve more users with improved efficiency and reduced costs, driving broader adoption and innovation in AI applications.
-
Accelerating Inference with Skip Softmax in TensorRT-LLM
Read Full Article: Accelerating Inference with Skip Softmax in TensorRT-LLM
Skip Softmax is a technique designed to accelerate long-context inference in large language models (LLMs) by optimizing the attention computation process. It achieves this by dynamically pruning attention blocks that contribute minimally to the output, thereby reducing computation time without the need for retraining. This method is compatible with existing models and leverages NVIDIA's Hopper and Blackwell GPUs for enhanced performance, offering up to 1.4x speed improvements in both time-to-first-token and time-per-output-token. Skip Softmax maintains accuracy while providing substantial efficiency gains, making it a valuable tool for machine learning engineers working with long-context scenarios. This matters because it addresses the critical bottleneck of attention computation, enabling faster and more efficient deployment of LLMs at scale.
