NVIDIA GTC25 NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick : US Pioneer Global VC DIFCHQ SFO NYC Singapore – Riyadh Swiss Our Mind

The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick. Accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, and are available to try as NVIDIA NIM microservices.

The Llama 4 models are now natively multimodal and multilingual using a mixture-of-experts (MoE) architecture. The Llama 4 models delivers a variety of multimodal capabilities, driving advances in scale, speed, and efficiency that enable you to build more personalized experiences.

Llama 4 Scout is a 109B-parameter model, 17B active per token, with a configuration of 16 experts boasting a 10M context-length window, and optimized and quantized to int4 for a single NVIDIA H100 GPU. This enables a variety of use cases, including multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.

Llama 4 Maverick is a 400B- parameter model, 17B active per token, with a configuration of 128 experts accepting 1M context length. The model delivers high-performance image and text understanding.

Optimized for NVIDIA TensorRT-LLM 

NVIDIA optimized both Llama 4 Scout and Llama 4 Maverick models for NVIDIA TensorRT-LLM. TensorRT-LLM is an open-source library used to accelerate LLM inference performance for the latest foundation models on NVIDIA GPUs.

You can use TensorRT Model Optimizer, a library that can refactor bfloat16 models with the latest algorithmic model optimizations and quantization techniques, to accelerate inference with the Blackwell FP4 Tensorcore performance without impacting model accuracy.

On the Blackwell B200 GPU, TensorRT-LLM delivers a throughput of over 40K tokens per second with an NVIDIA-optimized FP8 version of Llama 4 Scout as well as over 30K tokens per second on Llama 4 Maverick.

A two-entity bar chart shows the performance of tokens per second on the NVIDIA H200 GPU at 12,432 and a 3.4x increase for the B200 GPU at 42,400 tokens.
Figure 1. Tokens per second Llama 4 Scout on NVIDIA GPUs

Blackwell delivers massive performance leaps due to architectural innovations, including a second-generation Transformer Engine, fifth-generation NVLink, and FP8, FP6, and FP4 precision that enable higher performance for both training and inference. For Llama 4, these advancements provide you with 3.4x faster throughput and 2.6x better cost per token compared to NVIDIA H200.

The latest Llama 4 optimizations are available on the open source NVIDIA/TensorRT-LLM GitHub repository.

Ongoing collaboration of Meta and NVIDIA

NVIDIA and Meta have a long track record of collaborating to advance open models. NVIDIA is an active open-source contributor, which helps you work efficiently, addressing your toughest challenges, and advancing performance and lowering cost.

Open-source models also promote AI transparency and let users broadly share work on AI safety and resilience. These open models combined with NVIDIA accelerated computing equips developers, researchers, and businesses to innovate responsibly across a wide variety of applications.

Post-train Llama models for higher accuracy

Fine-tuning the Llama models is seamless with NVIDIA NeMo, an end-to-end framework built for customizing large language models (LLMs) with your enterprise data.

Start by curating high-quality pretraining or fine-tuning datasets using NeMo Curator, which helps extract, filter, and deduplicate structured and unstructured data at scale. Then, use NeMo to fine-tune the Llama models efficiently with support for techniques like LoRA, PEFT, and full parameter tuning.

When the models are fine-tuned, you can evaluate model performance using NeMo Evaluator, which supports both industry benchmarks and custom test sets tailored to your specific use case.

With NeMo, enterprises gain a powerful, flexible workflow to adapt Llama models for production-ready AI applications.

Simplifying deployments with NVIDIA NIM

To ensure that enterprises can leverage them, the Llama 4 models will be packaged as NVIDIA NIM microservices, making it easy to deploy them on any GPU-accelerated infrastructure with flexibility, data privacy, and enterprise-grade security.

NIM also simplifies deployment through support for industry-standard APIs, so you can get up and running quickly. Whether you’re using LLMs, vision models, or multimodal AI, NIM abstracts away the complexity of infrastructure and enables seamless scaling across clouds, data centers, and edge environments.

Get started today

Try the Llama 4 NIM microservices to experiment with your own data and build a proof of concept by integrating the NVIDIA-hosted API endpoint into your application.