The compute demands for large language model (LLM) inference are growing rapidly, fueled by the combination of growing model sizes, real-time latency requirements, and, most recently, AI reasoning. At the same time, as AI adoption grows, the ability of an AI factory to serve as many users as possible, all while maintaining good per-user experiences, is key to maximizing the value it generates. Achieving high inference throughput and low inference latency on the latest models requires excellence across the entire technology stack-spanning silicon, network systems, and software.
MLPerf Inference v5.0 is the latest in a long-running benchmark suite that measures inference throughput across a range of different models and use cases. First introduced in 2019, MLPerf Inference has been continually updated with new models and scenarios to ensure that it remains a useful tool for measuring the inference performance of AI computing platforms.
This round adds three new benchmarks:
- Llama 3.1 405B: A 405-billion-parameter dense LLM. For the server scenario, the benchmark sets latency requirements of 6 seconds for time to first token (TTFT) and 175 ms for time per output token (TPOT).
- Llama 2 70B Interactive: A 70-billion-parameter dense LLM. This workload is based on the same Llama 2 70B model that was first introduced in MLPerf Inference v4.0, but features more stringent latency constraints of 450 ms TTFT and 40 ms TPOT (25 tokens per second per user).
- Relational Graph Attention Network (R-GAT): A graph neural network (GNN) benchmark. GNNs are applied in a wide range of domains, including social network analysis, drug discovery, fraud detection, and molecular chemistry.
These new benchmarks join the many returning benchmarks covering a diverse set of models and use cases: ResNet-50, RetinaNet, 3D U-Net, DLRMv2, GPT-J, Stable Diffusion XL, Llama 2 70B, and Mixtral 8x7B.
NVIDIA submitted results on every benchmark in the data center category, delivering outstanding performance across the board, including new performance results on the newly-added Llama 3.1 405B, Llama 2 70B Interactive, and GNN tests. This round, NVIDIA also submitted many results on the Blackwell architecture, using both the NVIDIA GB200 NVL72 as well as NVIDIA DGX B200, which delivered substantial speedups over the prior-generation NVIDIA Hopper architecture. Hopper also continued to deliver excellent performance across the board three years after it was introduced, driven by software enhancements that continue to increase performance for that GPU family.
In this post, we take a closer look at the performance results and provide additional detail on the full-stack innovations that made them possible.
Blackwell sets the new performance standard in MLPerf
The NVIDIA Blackwell architecture, introduced at NVIDIA GTC 2024, is in full production, with availability from major cloud service providers and a broad number of server makers. Blackwell incorporates many architectural innovations–including second-generation Transformer Engine, fifth-generation NVLink, FP4 and FP6 precisions, and more–that enable dramatically higher performance for both training and inference.
The Blackwell platform is available in multiple different system form factors to serve a wide range of datacenter deployment requirements. NVIDIA submitted results using both the GB200 NVL72, a rack-scale system featuring 36 Grace CPUs and 72 Blackwell GPUs fully connected using NVLink and NVSwitch, as well as the DGX B200, which incorporates eight Blackwell GPUs connected using NVLink and NVSwitch.
Additionally, in this round, Blackwell submissions on Llama 3.1 405B, Llama 2 70B Interactive, Llama 2 70B, and Mixtral 8x7B made use of second-generation Transformer Engine with FP4 Tensor Cores, NVIDIA TensorRT-LLM software for efficient model execution, and TensorRT Model Optimizer for FP4 quantization. The combination of these technologies enabled the use of FP4 precision–which offers twice the peak throughput on Blackwell compared to FP8–while meeting benchmark accuracy requirements.
On the Llama 3.1 405B benchmark, GB200 NVL72 delivered up to 3.4x higher per-GPU performance compared to the NVIDIA H200 Tensor Core eight-GPU system.

MLPerf Training v5.0 results retrieved from http://www.mlcommons.org on April 2, 2025, from the following entries: 5.0-0058, 5.0-0060. Per-GPU performance is not a primary metric of MLPerf Inference v5.0 and is derived by dividing reported throughput by accelerator count. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See http://www.mlcommons.org for more information.
At the system level, GB200 NVL72 increases performance by up to 30x, through a combination of much higher per-GPU performance as well as 9x more GPUs in the system, all connected on a single NVLink domain with NVLink and NVLink Switch.

MLPerf Training v5.0 results retrieved from http://www.mlcommons.org on April 2, 2025, from the following entries: 5.0-0058, 5.0-0060. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See http://www.mlcommons.org for more information.
Additionally, NVIDIA ran the Llama 2 70B benchmark from MLPerf Inference v4.1 on GB200 NVL72, achieving an unverified result of 869,203 tokens/second.
On the Llama 2 70B Interactive benchmark, the eight-GPU B200 system achieved 3.1x higher throughput compared to the NVIDIA submission using eight H200 GPUs.

MLPerf Inference v5.0, Closed, Data Center. Results retrieved from www.mlcommons.org on April 2, 2025. Results from the following entries: 5.0-0056, 5.0-0060. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See http://www.mlcommons.org for more information.
B200 also delivered significant speedups on Llama 2 70B, Mixtral 8x7B, and Stable Diffusion XL.
Benchmark | 8x Blackwell GPU Server | Offline |
8x H200 GPU Server | Offline |
Blackwell Speedups Server | Offline |
Llama 2 70B Tokens/sec |
98,443 | 98,858 | 33,072 | 34,988 | 3x | 2.8x |
Mixtral 8x7B Tokens/sec |
126,845 | 128,148 | 61,802 | 62,630 | 2.1x | 2.1x |
Stable Diffusion XL Samples/sec | Queries/sec |
28.44 | 30.38 | 18.30 | 18.99 | 1.6x | 1.6x |
MLPerf Inference v5.0, Closed, Data Center. Results retrieved from www.mlcommons.org on April 2, 2025. Results from the following entries: 5.0-0056, 5.0-0060. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See http://www.mlcommons.org for more information.
Hopper continues to deliver outstanding performance
The Hopper platform, first introduced in March of 2022, continued to deliver outstanding inference performance on every benchmark in MLPerf Inference v5.0, including on the newly added Llama 3.1 405B and Llama 2 70B Interactive benchmarks.
As cloud service providers and enterprises seek to maximize the useful lives of their accelerated infrastructure investments, the ability of a platform to support new AI models and use cases is critical. At the same time, the inference throughput of an AI factory directly depends on its inference throughput–by increasing throughput for a given model on the same infrastructure with new software, token generation costs can decrease and AI revenue generation potential can increase.
On the Llama 2 70B benchmark, NVIDIA H100 Tensor Core GPU throughput driven by software optimizations has increased by up to 1.5x in the last year. Those optimizations include GEMM and attention kernel optimizations, advanced kernel fusions, chunked prefill, and more. Additionally pipeline parallelism improvements in TensorRT-LLM played an important role, helping to increase Llama 2 70B throughput on H100.
The Hopper architecture features an NVLink Switch, which allows each GPU to communicate with any other GPU at full bandwidth, irrespective of the number of GPUs that are communicating. This provides developers with the flexibility to select optimal parallelism mappings to maximize throughput for a given latency constraint. The NVLink Switch communication can be further overlapped with GEMM computation at a fine-grained level, helping to increase Llama 3.1 405B throughput on H200 NVL8 .
The result of these ongoing optimizations is that Hopper achieved excellent performance on MLPerf’s latest and most challenging workloads, Llama 3.1 405B and Llama 2 70B Interactive.
The NVIDIA platform was also the only one to submit results on the Mixtral 8x7B benchmark, which uses a mixture-of-experts (MoE) model architecture, with Hopper performance increasing over the prior round. And, performance on the GPT-J benchmark increased yet again, bringing the cumulative improvement on Hopper since the benchmark was first introduced to 2.9x in the offline scenario and 3.8x in the server scenario.
Wrapping up
The NVIDIA Hopper platform delivers leadership performance in both training, as shown in the most recent round of MLPerf Training, as well as in MLPerf Inference, as these results show. Hopper remains an industry-leading platform three years after it was first launched, and with continued full-stack optimization, it continues to deliver performance increases on existing AI use cases and support new ones, offering high longevity.
NVIDIA Blackwell sets the new standard for performance and energy efficiency–the key drivers of AI factory revenues and profitability. By delivering large performance gains on existing workloads and enabling even greater gains for more demanding scenarios, including the latest reasoning models, Blackwell is enabling the next wave of AI innovation.
And, NVIDIA is scaling AI reasoning with Dynamo, which runs on Hopper and Blackwell GPUs.
Acknowledgments
The work of many NVIDIA employees made these outstanding results happen. We would like to acknowledge the tireless efforts of Kefeng Duan, Shengliang Xu, Yilin Zhang, Robert Overman, Shobhit Verma, Viraat Chandra, Zihao Kong, Tin-Yin Lai, and Alice Cheng, among many others.
Results obtained using NVIDIA MLPerf v4.1 code with TensorRT-LLM 0.18.0.dev. Unverified MLPerf v4.1 Inference Closed Llama 2 70B offline. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf specification for verified results. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.
Related resources
- GTC session: Enable Blackwell Inference With TensorRT Model Optimizer
- GTC session: How Math Libraries Can Help Accelerate Your Applications on Blackwell GPUs
- GTC session: Unlock Deep Learning Performance on Blackwell With cuDNN
- NGC Containers: NVIDIA MLPerf Inference
- NGC Containers: NVIDIA MLPerf Inference
- SDK: Triton Inference Server
NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0