In an important development for AI inference, NVIDIA has unveiled the TensorRT-LLM Multi-Block Attention feature, which dramatically improves throughput on the NVIDIA HGX H200 platform. according to NvidiaThis innovation boosts throughput by more than 3x for long sequence lengths, meeting the increasing demands of modern generative AI models.
Advances in generative artificial intelligence
The rapid development of generative AI models, exemplified by the Llama 2 and Llama 3.1 series, has led to models with much larger contextual windows. For example, Llama 3.1 models support context lengths of up to 128,000 tokens. This scaling enables AI models to perform complex cognitive tasks across large-scale datasets, but also presents unique challenges in AI inference environments.
Challenges in artificial intelligence reasoning
AI inference, especially with long sequence lengths, faces obstacles such as low latency requirements and the need for small batch sizes. Traditional GPU deployment methods often underestimate the utilization of the streaming multiprocessors (SMs) of NVIDIA GPUs, especially during the decoding phase of inference. This lack of usage impacts overall system throughput, as only a small portion of small GPUs are powered, leaving many resources idle.
Multiblock solution of interest
NVIDIA's TensorRT-LLM multi-block attention addresses these challenges by maximizing GPU resource utilization. It divides the computational tasks into smaller blocks, and distributes them among all available small devices. This not only alleviates memory bandwidth limitations, but also enhances throughput by efficiently utilizing GPU resources during the decoding phase.
Performance on NVIDIA HGX H200
Implementing the Multiblock feature on the NVIDIA HGX H200 showed remarkable results. It enables the system to generate up to 3.5x more tokens per second for long-sequence queries in low latency scenarios. Even when using model parallelism, which results in the use of half the GPU resources, a 3x performance increase is observed without affecting the time to the first token appearance.
Implications and future expectations
This advance in AI inference technology allows existing systems to support greater context lengths without requiring additional hardware investments. Multi-block TensorRT-LLM attention is activated by default, providing a significant performance boost for AI models with extensive context requirements. This development underscores NVIDIA's commitment to advancing AI inference capabilities, enabling more efficient processing of complex AI models.
Image source: Shutterstock