New gpt-oss model from NVIDIA and OpenAI hits record 1.5M tokens per second : US Pioneer Global VC DIFCHQ SFO NYC Singapore – Riyadh Swiss Our Mind

Optimized for advanced reasoning tasks, gpt-oss-120b and gpt-oss-20b offer developers unprecedented access to cutting-edge AI tools.

penAI and NVIDIA have unveiled two cutting-edge open-weight large language models (LLMs) — gpt-oss-120b and gpt-oss-20b — designed to bring advanced reasoning capabilities into the hands of developers, researchers, startups, and enterprises worldwide.

These models mark a major step forward in open AI development, offering state-of-the-art performance, broad flexibility, and efficiency across a wide range of deployment environments.

Trained on NVIDIA’s H100 GPUs and optimized for deployment across its massive CUDA ecosystem, the models run best on Blackwell-powered GB200 NVL72 systems, achieving inference speeds of 1.5 million tokens per second.

Blackwell at the core

Both models are released under the Apache 2.0 license, allowing full commercial and research use.

“OpenAI showed the world what could be built on NVIDIA AI — and now they’re advancing innovation in open-source software,” said Jensen Huang, founder and CEO of NVIDIA.

“The gpt-oss models let developers everywhere build on that state-of-the-art open-source foundation, strengthening U.S. technology leadership in AI — all on the world’s largest AI compute infrastructure.”

The gpt-oss-120b model achieves near-parity with OpenAI’s o4-mini on core reasoning benchmarks and can run on a single 80 GB GPU, while the smaller gpt-oss-20b matches the performance of o3-mini and is optimized to run on edge devices with just 16 GB of memory.

Both models perform strongly in chain-of-thought (CoT) reasoning, tool use, and structured outputs, and are ideal for low-latency, real-time tasks.

Framework flexibility for developers

The models are fully compatible with leading frameworks like FlashInfer, Hugging Face, llama.cpp, Ollama, and vLLM, alongside NVIDIA’s TensorRT-LLM stack.

This flexibility enables developers to use their preferred tools while benefiting from NVIDIA’s end-to-end optimization.

Architecturally, both models use a Mixture-of-Experts (MoE) approach. gpt-oss-120b contains 117 billion parameters, with only 5.1 billion active per token, while gpt-oss-20b uses 3.6 billion active parameters out of a total 21 billion.

Both support 128k context lengths, employ Rotary Positional Embeddings, and feature advanced attention techniques that balance power and memory efficiency.

In benchmark testing, gpt-oss-120b outperformed several proprietary models, including OpenAI’s o1 and o4-mini, on tasks related to healthcare (HealthBench), mathematics (AIME 2024 and 2025), and coding (Codeforces).

The smaller gpt-oss-20b performed comparably, even with significantly lighter infrastructure demands.

The models were trained using a mix of supervised fine-tuning, reinforcement learning, and techniques from OpenAI’s top-tier proprietary systems.

They support variable reasoning effort settings (low, medium, high), allowing developers to balance performance with latency.

To ensure safety, the models were evaluated using OpenAI’s Preparedness Framework and adversarial fine-tuning tests. Independent experts reviewed the methodology, helping establish safety standards comparable to the company’s closed frontier models.

OpenAI and NVIDIA have also partnered with major deployment platforms like Azure, AWS, Vercel, and Databricks, and hardware leaders including AMD, Cerebras, and Groq. Microsoft is enabling local inference of gpt-oss-20b on Windows devices via ONNX Runtime.

https://interestingengineering.com/innovation/openai-nvidia-open-weight-ai-models