AI Inference

The days of raw speed being the only metric that matters are behind us. Now it’s about throughput, efficiency, and economics at scale. As AI evolves from providing one-shot answers to engaging in multi-step reasoning, the demand for inference and its underlying economics is increasing. This shift significantly boosts compute demand due to the generation of far more tokens per query. Metrics such as tokens per watt, cost per million tokens, and tokens per second per user are crucial alongside throughput. For power-limited AI factories, NVIDIA's continuous software improvements translate into higher token revenue over time, underscoring the importance of our technological advancements.

Pareto curves illustrate how NVIDIA Blackwell provides the best balance across the full spectrum of production priorities, including cost, energy efficiency, throughput, and responsiveness. Optimizing systems for a single scenario can limit deployment flexibility,‌ leading to inefficiencies at other points on the curve. NVIDIA’s full-stack design approach ensures efficiency and value across multiple real-life production scenarios. Blackwell’s leadership stems from its extreme hardware-software co-design, embodying a full-stack architecture built for speed, efficiency, and scalability.

Learn about how Mixture of Experts Powers the Most Intelligent Frontier AI Models, Runs 10x Faster on NVIDIA Blackwell NVL72 in this blog.

MLPerf Inference v5.1 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
DeepSeek R1 420,659 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
289,712 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
33,379 tokens/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1
Llama3.1 405B 16,104 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
14,774 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
1,660 tokens/sec 8x B200 Dell PowerEdge XE9685L NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
553 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B 51,737 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024)
102,909 tokens/sec 8x B200 ThinkSystem SR680a V3 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024)
35,317 tokens/sec 8x H200 Dell PowerEdge XE9680 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024)
Llama3.1 8B 146,960 tokens/sec 8x B200 ThinkSystem SR780a V3 NVIDIA B200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) CNN Dailymail (v3.0.0, max_seq_len=2048)
66,037 tokens/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) CNN Dailymail (v3.0.0, max_seq_len=2048)
Whisper 22,273 samples/sec 4x GB200 BM.GPU.GB200.4 NVIDIA GB200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech
45,333 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech
34,451 samples/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech
Stable Diffusion XL 33 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
19 samples/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
RGAT 651,230 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP32 (72.86%) IGBH
RetinaNet 14,997 samples/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 (0.3755 mAP) OpenImages (800x800)
DLRMv2 647,861 samples/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP32 and 99.9% of FP32 (AUC=80.31%) Synthetic Multihot Criteo Dataset

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
DeepSeek R1 209,328 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
167,578 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
18,592 tokens/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1
Llama3.1 405B 12,248 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
11,614 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
1,280 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
296 tokens/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama3.1 405B Interactive 9,921 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
771 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
203 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B 49,360 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
101,611 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
34,194 tokens/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024)
Llama2 70B Interactive 29,746 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
62,851 tokens/sec 8x B200 G894-SD1 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
23,080 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024)
Llama3.1 8B 128,794 tokens/sec 8x B200 Dell PowerEdge XE9685L NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/100 ms OpenOrca (max_seq_len=1024)
64,915 tokens/sec 8x H200 HPE Cray XD670 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/100 ms OpenOrca (max_seq_len=1024)
Llama3.1 8B Interactive 122,269 tokens/sec 8x B200 AS-4126GS-NBR-LCC NVIDIA B200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) TTFT/TPOT: 500 ms/30 ms CNN Dailymail (v3.0.0, max_seq_len=2048)
54,118 tokens/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) TTFT/TPOT: 500 ms/30 ms CNN Dailymail (v3.0.0, max_seq_len=2048)
Stable Diffusion XL 29 queries/sec 8x B200 Supermicro SYS-422GA-NBRT-LCC NVIDIA B200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
18 queries/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
RetinaNet 14,406 queries/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99% of FP32 (0.3755 mAP) 100 ms OpenImages (800x800)
DLRMv2 591,162 queries/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99% of FP32 (AUC=80.31%) 60 ms Synthetic Multihot Criteo Dataset

MLPerf™ v5.1 Inference Closed: DeepSeek R1 99% of FP16, Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Stable Diffusion XL, Whisper, RetinaNet, RGAT, DLRM 99% of FP32 accuracy target: 5.1-0007, 5.1-0009, 5.1-0026, 5.1-0028, 5.1-0046, 5.1-0049, 5.1-0060, 5.1-0061, 5.1-0062, 5.1-0069, 5.1-0070, 5.1-0071, 5.1-0072, 5.1-0073, 5.1-0075, 5.1-0077, 5.1-0079, 5.1-0086. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama3.1 8B Max Sequence Length = 2,048
Llama2 70B Max Sequence Length = 1,024
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

B200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B 1 8 128 2048 66,057 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 235B A22B 1 8 128 4096 39,496 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 235B A22B 1 8 2048 128 7,329 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 235B A22B 1 8 5000 500 8,190 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 235B A22B 1 8 500 2000 57,117 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 235B A22B 1 8 1000 1000 42,391 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 235B A22B 1 8 1000 2000 34,105 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 235B A22B 1 8 2048 2048 26,854 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 235B A22B 1 8 20000 2000 4,453 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 128 2048 37,844 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 128 4096 24,953 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 2048 128 6,251 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 5000 500 6,142 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 500 2000 27,817 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 1000 1000 25,828 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 1000 2000 22,051 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 2048 2048 17,554 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Qwen3 30B A3B 1 1 20000 2000 2,944 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Maverick 1 8 128 2048 112,676 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Maverick 1 8 128 4096 68,170 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Maverick 1 8 2048 128 18,088 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Maverick 1 8 1000 1000 79,617 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Maverick 1 8 1000 2000 63,766 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Maverick 1 8 2048 2048 52,195 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Maverick 1 8 20000 2000 12,678 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 128 2048 14,699 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 128 4096 8,932 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 2048 128 3,137 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 5000 500 2,937 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 500 2000 11,977 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 1000 1000 10,591 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 1000 2000 9,356 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 2048 2048 7,152 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v4 Scout 1 1 20000 2000 1,644 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.3 70B 1 1 128 2048 9,922 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.3 70B 1 1 128 4096 6,831 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.3 70B 1 1 2048 128 1,339 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.3 70B 1 1 5000 500 1,459 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.3 70B 1 1 500 2000 7,762 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.3 70B 1 1 1000 1000 7,007 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.3 70B 1 1 1000 2000 6,737 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 0.19.0 NVIDIA B200
Llama v3.3 70B 1 1 2048 2048 4,783 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.3 70B 1 1 20000 2000 665 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 128 2048 8,020 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 128 4096 6,345 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 2048 128 749 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 5000 500 1,048 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 500 2000 6,244 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 1000 1000 5,209 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 1000 2000 4,933 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 2048 2048 4,212 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
Llama v3.1 405B 1 4 20000 2000 672 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

RTX PRO 6000 Blackwell Server Edition Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v4 Scout 4 1 128 128 17,857 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout 4 1 128 2048 9,491 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout 2 2 128 4096 6,281 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout 4 1 2048 128 3,391 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout 4 1 5000 500 2,496 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout 4 1 500 2000 9,253 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout 4 1 1000 1000 8,121 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout 4 1 1000 2000 6,980 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout 4 1 2048 2048 4,939 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B 2 1 128 2048 4,776 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B 2 1 128 4096 2,960 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B 2 1 500 2000 4,026 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B 2 1 1000 1000 3,658 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B 2 1 1000 2000 3,106 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B 2 1 2048 2048 2,243 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B 2 1 20000 2000 312 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B 8 1 128 128 4,866 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B 8 1 128 2048 3,132 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B 8 1 2048 128 588 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B 8 1 5000 500 616 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B 8 1 500 2000 2,468 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B 8 1 1000 1000 2,460 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B 8 1 1000 2000 2,009 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B 8 1 2048 2048 1,485 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B 1 1 128 128 22,757 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B 1 1 128 4096 7,585 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B 1 1 2048 128 2,653 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B 1 1 5000 500 2,283 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B 1 1 500 2000 10,612 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B 1 1 1000 2000 8,000 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B 1 1 2048 2048 5,423 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B 1 1 20000 2000 756 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

H200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B 1 8 128 2048 42,821 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Qwen3 235B A22B 1 8 128 4096 26,852 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Qwen3 235B A22B 1 8 2048 128 3,331 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Qwen3 235B A22B 1 8 5000 500 3,623 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Qwen3 235B A22B 1 8 500 2000 28,026 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Qwen3 235B A22B 1 8 1000 1000 23,789 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Qwen3 235B A22B 1 8 1000 2000 22,061 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Qwen3 235B A22B 1 8 2048 2048 16,672 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Qwen3 235B A22B 1 8 20000 2000 1,876 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 128 2048 40,572 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 128 4096 24,616 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 2048 128 7,307 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 5000 500 8,456 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 500 2000 37,835 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 1000 1000 31,782 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 1000 2000 34,734 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 2048 2048 20,957 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Maverick 1 8 20000 2000 4,106 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 128 2048 34,316 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 128 4096 21,332 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 2048 128 3,699 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 5000 500 4,605 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 500 2000 24,630 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 1000 1000 21,636 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 1000 2000 18,499 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 2048 2048 14,949 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v4 Scout 1 4 20000 2000 2,105 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 128 2048 4,336 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 128 4096 2,872 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 2048 128 442 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 5000 500 566 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 500 2000 3,666 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 1000 1000 2,909 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 1000 2000 2,994 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 2048 2048 2,003 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.3 70B 1 1 20000 2000 283 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 405B 1 8 128 2048 5,661 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.19.0 NVIDIA H200
Llama v3.1 405B 1 8 128 4096 5,167 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.19.0 NVIDIA H200
Llama v3.1 405B 1 8 2048 128 456 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 405B 1 8 5000 500 650 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 405B 1 8 500 2000 4,724 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 405B 1 8 1000 1000 3,330 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 405B 1 8 1000 2000 3,722 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 405B 1 8 2048 2048 2,948 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 405B 1 8 20000 2000 505 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 128 2048 26,221 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 128 4096 18,027 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 2048 128 3,538 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 5000 500 3,902 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 500 2000 20,770 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 1000 1000 17,744 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 1000 2000 16,828 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 2048 2048 12,194 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
Llama v3.1 8B 1 1 20000 2000 1,804 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

H100 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.3 70B 1 2 128 2048 6,651 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.3 70B 1 2 128 4096 4,199 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.3 70B 1 2 2048 128 762 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.3 70B 1 2 5000 500 898 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.3 70B 1 2 500 2000 5,222 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.3 70B 1 2 1000 1000 4,205 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.3 70B 1 2 1000 2000 4,146 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.3 70B 1 2 2048 2048 3,082 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.3 70B 1 2 20000 2000 437 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 128 2048 4,340 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 128 4096 3,116 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 2048 128 453 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 5000 500 610 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 500 2000 3,994 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 1000 1000 2,919 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 1000 2000 2,895 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 2048 2048 2,296 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 405B 1 8 20000 2000 345 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 128 2048 22,714 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 128 4096 14,325 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 2048 128 3,450 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 5000 500 3,459 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 500 2000 17,660 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 1000 1000 15,220 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 1000 2000 13,899 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 2048 2048 9,305 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
Llama v3.1 8B 1 1 20000 2000 1,351 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v4 Scout 2 2 128 2048 1,105 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout 2 2 128 4096 707 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout 4 1 2048 128 561 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout 4 1 5000 500 307 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout 2 2 500 2000 1,093 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout 2 2 1000 1000 920 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout 2 2 1000 2000 884 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v4 Scout 2 2 2048 2048 615 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B 4 1 128 2048 1,694 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B 2 2 128 4096 972 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B 4 1 500 2000 1,413 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B 4 1 1000 1000 1,498 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B 4 1 1000 2000 1,084 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.3 70B 4 1 2048 2048 773 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B 1 1 128 128 8,471 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B 1 1 128 4096 2,888 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B 1 1 2048 128 1,017 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B 1 1 5000 500 863 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B 1 1 500 2000 4,032 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B 1 1 1000 2000 3,134 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B 1 1 2048 2048 2,148 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Llama v3.1 8B 1 1 20000 2000 280 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

Inference Performance of NVIDIA Data Center Products

B200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)16.8 images/sec- 225.551x B200DGX B20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA B200
Stable Diffusion XL12.85 images/sec- 522.861x B200DGX B20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA B200
ResNet-50v1.52048118,265 images/sec121 images/sec/watt17.321x B200DGX B20025.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA B200
BEVFusion Head12869.15 images/sec6 sequences/sec/watt0.351x B200DGX B20025.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA B200
Flux Image Generator1.48 images/sec- sequences/sec/watt2079.781x B200DGX B20025.08-py3FP4SyntheticTensorRT 10.13.2NVIDIA B200
HF Swin Base1284,572 samples/sec5 samples/sec/watt281x B200DGX B20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA B200
HF Swin Large1282,820 samples/sec3 samples/sec/watt45.41x B200DGX B20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA B200
HF ViT Base10248,839 samples/sec9 samples/sec/watt115.851x B200DGX B20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA B200
HF ViT Large20483,127 samples/sec3 samples/sec/watt655.021x B200DGX B20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA B200
Yolo v10 M1849.29 sequences/sec1 sequences/sec/watt1.181x B200DGX B20025.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA B200
Yolo v11 M11043.32 samples/sec1 samples/sec/watt0.961x B200DGX B20025.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA B200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384

H200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)13.92 images/sec- 3301x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
Stable Diffusion XL11.6 images/sec- 750.221x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
ResNet-50v1.5204881,317 images/sec117 images/sec/watt25.191x H200DGX H20025.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA H200
BEVFusion Head12005.18 sequences/sec6 sequences/sec/watt0.51x H200DGX H20025.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA H200
Flux Image Generator1.21 images/sec- 4813.581x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
HF Swin Base1282,976 samples/sec4 samples/sec/watt431x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
HF Swin Large1281,803 samples/sec3 samples/sec/watt70.981x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
HF ViT Base20484,930 samples/sec7 samples/sec/watt415.41x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
HF ViT Large20481,684 samples/sec2 samples/sec/watt1215.821x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
Yolo v10 M1432.01 images/sec1 images/sec/watt2.311x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
Yolo v11 M8509.23 images/sec1 images/sec/watt1.961x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5204878,875 images/sec119 images/sec/watt25.971x GH200NVIDIA P388025.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA GH200
BEVFusion Head12013.77 images/sec6 images/sec/watt0.51x GH200NVIDIA P388025.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA GH200
Flux Image Generator1. images/sec- 1x H200DGX H20025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA H200
HF Swin Base1282,886 samples/sec4 samples/sec/watt44.351x GH200NVIDIA P388025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA GH200
HF Swin Large1281,733 samples/sec3 samples/sec/watt73.871x GH200NVIDIA P388025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA GH200
HF ViT Base20484,710 samples/sec7 samples/sec/watt434.791x GH200NVIDIA P388025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA GH200
HF ViT Large20481,626 samples/sec2 samples/sec/watt1259.681x GH200NVIDIA P388025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA GH200
Yolo v10 M1433.57 images/sec1 images/sec/watt2.311x GH200NVIDIA P388025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA GH200
Yolo v11 M1504.17 images/sec1 images/sec/watt1.981x GH200NVIDIA P388025.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA GH200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)13.83 images/sec- 340.561x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB
Stable Diffusion XL11.6 images/sec- 774.711x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB
ResNet-50v1.5204875,476 images/sec110 images/sec/watt27.131x H100DGX H10025.08-py3INT8SyntheticTensorRT 10.13.2H100 SXM5-80GB
BEVFusion Head11998.95 images/sec6 images/sec/watt0.51x H100DGX H10025.08-py3INT8SyntheticTensorRT 10.13.2H100 SXM5-80GB
Flux Image Generator1.21 images/sec- 4747.11x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB
HF Swin Base1282,852 samples/sec4 samples/sec/watt44.881x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB
HF Swin Large1281,792 samples/sec3 samples/sec/watt71.441x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB
HF ViT Base20484,988 samples/sec7 samples/sec/watt410.581x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB
HF ViT Large20485,418 samples/sec8 samples/sec/watt377.971x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB
Yolo v10 M1407.43 images/sec1 images/sec/watt2.451x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB
Yolo v11 M1476 images/sec1 images/sec/watt2.11x H100DGX H10025.08-py3FP8SyntheticTensorRT 10.13.2H100 SXM5-80GB

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)11.65 images/sec- 607.211x L40SSupermicro SYS-521GE-TNRT25.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA L40S
Stable Diffusion XL1.6 images/sec- 1676.691x L40SSupermicro SYS-521GE-TNRT25.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA L40S
ResNet-50v1.5204823,555 images/sec68 images/sec/watt86.941x L40SSupermicro SYS-521GE-TNRT25.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA L40S
BEVFusion Head11944.21 images/sec7 images/sec/watt0.511x L40SSupermicro SYS-521GE-TNRT25.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA L40S
HF Swin Base321,376 samples/sec4 samples/sec/watt23.261x L40SSupermicro SYS-521GE-TNRT25.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA L40S
HF Swin Large32705 samples/sec2 samples/sec/watt45.421x L40SSupermicro SYS-521GE-TNRT25.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA L40S
HF ViT Base10241,655 samples/sec5 samples/sec/watt618.881x L40SSupermicro SYS-521GE-TNRT25.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA L40S
HF ViT Large2048570 samples/sec2 samples/sec/watt3591.091x L40SSupermicro SYS-521GE-TNRT25.08-py3FP8SyntheticTensorRT 10.13.2NVIDIA L40S
Yolo v10 M1273.25 samples/sec1 samples/sec/watt3.661x L40SSupermicro SYS-521GE-TNRT25.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA L40S
Yolo v11 M1308 images/sec1 images/sec/watt3.251x L40SSupermicro SYS-521GE-TNRT25.08-py3INT8SyntheticTensorRT 10.13.2NVIDIA L40S

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.