As robotics and autonomous vehicles advance, accelerating development of physical AI—which enables autonomous machines to perceive, understand, and perform complex actions in the physical world—has become essential. At the center of these systems are world foundation models (WFMs)—AI models that simulate physical states through physics-aware videos, enabling machines to make accurate decisions and interact seamlessly with their surroundings.
NVIDIA Cosmos is a platform that helps developers build custom world models for physical AI systems at scale. It offers open world foundation models and tools for every stage of development, from data curation to training to customization.
This post explains Cosmos and its key features that accelerate physical AI development.
Accelerating world model development with NVIDIA Cosmos
Building physical AI is challenging, demanding precise simulations and real-world behavior understanding and prediction. A key tool for overcoming these challenges is a world model, which predicts future environmental states based on past observations and current inputs. These models are invaluable for physical AI builders, enabling them to simulate, train, and refine systems in controlled environments.
However, developing effective world models requires vast amounts of data, computational power, and real-world testing, which can introduce significant safety risks, logistical hurdles, and prohibitive costs. To address these challenges, developers often turn to synthetic data generated from 3D simulations to train models. While synthetic data is a powerful tool, creating it is resource-intensive and may fall short of accurately reflecting real-world physics, particularly in complex or edge-case scenarios.
The end-to-end NVIDIA Cosmos platform accelerates world model development for physical AI systems. Built on CUDA, Cosmos combines state-of-the-art world foundation models, video tokenizers, and AI-accelerated data processing pipelines.
Developers can accelerate world model development by fine-tuning Cosmos world foundation models or building new ones from the ground up. In addition to Cosmos world foundation models, the platform also includes:
- NVIDIA NeMo Curator for efficient video data curation
- Cosmos Tokenizer for efficient, compact, and high-fidelity video tokenization
- Cosmos world foundation models pretrained for robotics and autonomous driving applications
- NVIDIA NeMo Framework for model training and optimization
Pretrained world foundation models for physical AI
Cosmos world foundation models are pretrained large generative AI models trained on 9,000 trillion tokens—including 20 million hours of data from autonomous driving, robotics, synthetic environments, and other related domains. These models create realistic synthetic videos of environments and interactions, providing a scalable foundation for training complex systems, from simulating humanoid robots performing advanced actions to developing end-to-end autonomous driving models.
These models use two architectures: autoregressive and diffusion. Both approaches use the transformer architecture for its scalability and effectiveness in handling complex temporal dependencies.
Autoregressive model
Cosmos autoregressive model is designed for video generation, predicting the next token based on input text and past video frames. It uses a transformer decoder architecture, with key modifications for world model development.
- 3D RoPE (Rotary Position Embeddings) encodes spatial and temporal dimensions separately, ensuring precise video sequence representation.
- Cross-attention layers enable text inputs, providing better control over world generation.
- QK-normalization enhances training stability.
Pretraining of this model is progressive, starting with predicting up to 17 future frames from a single input frame, then extending to 34 frames, and eventually up to 121 frames (or 50,000 tokens). Text inputs are introduced to combine descriptions with video frames, and the model is fine-tuned with high-quality data for robust performance. This structured approach enables the model to generate videos of varying lengths and complexities, with or without text inputs.
Diffusion models
Diffusion models are popular for generating images, videos, and audio due to their ability to deconstruct training data and reconstruct it based on user input, producing high-quality, realistic outputs.
Diffusion models operate in two phases:
- Forward diffusion process: Training data is progressively corrupted by adding Gaussian noise over multiple steps, effectively transforming it into pure noise.
- Reverse diffusion process: The model learns to reverse this noise step by step, recovering the original data by denoising the corrupted input.
Once trained, diffusion models generate new data by sampling random Gaussian noise and passing it through the learned denoising process. In addition, Cosmos diffusion models also get several key updates tailored for physical AI development.
- 3D patchification processes video into smaller patches, simplifying spatio-temporal sequence representation.
- Hybrid positional embeddings handle spatial and temporal dimensions, supporting videos with varying resolutions and frame rates.
- Cross-attention layers incorporate text inputs, enabling better control over video generation based on descriptions.
- Adaptive layer normalization with LoRA reduces model size by 36%, maintaining high performance with fewer resources.
Model sizes for varied needs
Developers can choose from the following three model sizes to meet performance, quality, and deployment needs.
- Nano: Optimized for real-time, low-latency inference and edge deployment.
- Super: Designed as performant baseline models.
- Ultra: Focused on maximum quality and fidelity, ideal for distilling custom models.
Strengths and limitations
Cosmos world foundation models generate low-resolution, real-world-accurate synthetic video data, essential for training robotics and autonomous vehicle systems. While they lack artistic flair, their outputs closely replicate the physical world, making them ideal for precise object permanence and realistic scenarios in physical AI model training.
Guardrails for safe use of Cosmos world foundation models
AI models need guardrails to ensure reliability by mitigating hallucinations, preventing harmful outputs, safeguarding privacy, and aligning with AI standards for safe and controlled deployment. Cosmos ensures the safe use of its world foundation models through a customizable, two-stage guardrail system aligned with NVIDIA’s commitment to trustworthy AI.
Cosmos Guardrails operates in two stages: Pre-guard and Post-guard.
Pre-guard
This stage involves text prompt-based safety measures using two layers:
- Keyword Blocking: A blocklist checker scans prompts for unsafe keywords, using lemmatization to detect variations and blocking non-English terms or spelling errors.
- Aegis Guardrail: The NVIDIA fine-tuned Aegis AI Content Safety model detects and blocks semantically unsafe prompts, including categories like violence, harassment, and profanity. Unsafe prompts halt video generation and return an error message.
Post-guard
The Post-guard stage ensures the safety of generated videos through:
- Video Content Safety Classifier: A multiclass classifier evaluates every video frame for safety. If any frame is flagged as unsafe, the entire video is rejected.
- Face Blur Filter: All human faces in generated videos are blurred using the RetinaFace model to protect privacy and reduce biases based on age, gender, or race.
NVIDIA experts rigorously test with adversarial examples, annotating over 10,000 prompt-video pairs to refine the system and address edge cases.
Evaluating Cosmos world foundation models for 3D consistency and physics alignment
Cosmos benchmarks play a crucial role in assessing the ability of world foundation models to simulate real-world physics accurately and efficiently for physical AI applications. While publicly available benchmarks for video generation focus on fidelity, temporal consistency, and speed of generated videos, Cosmos benchmarks add new dimensions to evaluate generalist models: 3D consistency and physics alignment, ensuring the videos are evaluated based on accuracy required for physical AI systems.
3D consistency
Cosmos models were tested for 3D consistency on static scenes from a curated subset of 500 videos from an open dataset. Text prompts describing the videos were generated to avoid motion-related complexities. Comparisons were made against VideoLDM, a baseline generative model.
Metrics used
- Geometric Consistency: Assessed through epipolar geometry constraints using metrics like Sampson error and camera pose estimation success rate.
- View Synthesis Consistency: Evaluated through metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). These metrics measure the quality of synthesized views from interpolated camera positions.
Lower Sampson error and higher success rates indicate better 3D alignment. Similarly, higher PSNR and SSIM and lower LPIPS are indicators of a better quality.
Model | Sampson Error ↓ | Pose Estimation Success Rate (%) ↑ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
VideoLDM | 0.841 | 4.40% | 26.23 | 0.783 | 0.135 |
Cosmos 1.0 Diffusion Text2World 7B | 0.355 | 62.60% | 33.02 | 0.939 | 0.070 |
Cosmos 1.0 Diffusion Video2World 7B | 0.473 | 68.40% | 30.66 | 0.929 | 0.085 |
Cosmos 1.0 Autoregressive 4B | 0.433 | 35.60% | 32.56 | 0.933 | 0.090 |
Cosmos 1.0 Autoregressive Video2World 5B | 0.392 | 27.00% | 32.18 | 0.931 | 0.090 |
Real videos (reference) | 0.431 | 56.40% | 35.38 | 0.962 | 0.054 |
Results
Cosmos world foundation models outperform the baseline in 3D consistency (table 1), with higher geometric alignment and camera pose success rates. Their synthesized views match real-world quality, confirming their effectiveness as world simulators.
Physical alignment
Physics alignment tests how well Cosmos models simulate real-world physics, including motion, gravity, and energy dynamics. Using NVIDIA PhysX and NVIDIA Isaac Sim, eight controlled scenarios were designed to evaluate properties like gravity, collision, torque, and inertia in virtual environments.
Metrics used
- Pixel-Level Metrics: Peak Signal-to-Noise Ratio (PSNR) measures how closely the pixel values of the model’s output match the reference video. Higher values indicate less noise and better accuracy. Structural Similarity Index Measure (SSIM) assesses the similarity in structure, luminance, and contrast between the generated and ground-truth frames. Higher SSIM values reflect greater visual fidelity.
- Feature-Level Metric: DreamSim measures the similarity between high-level features extracted from both videos. This approach evaluates the semantic consistency of the generated content, focusing on objects and motion rather than individual pixels.
- Object-Level Metric: Intersection-over-Union (IoU) calculates the overlap between the predicted and actual object regions in the video. This is especially useful for tracking specific objects through the simulation to ensure their behavior aligns with physical expectations.
Higher PSNR, SSIM, DreamSim and IoU are indicators of better physical alignment.
Model | Conditioning | PSNR ↑ | SSIM ↑ | DreamSim ↑ | Avg. IoU ↑ |
Cosmos 1.0 Diffusion Video2World 7B | prompt + 1 frame | 17.34 | 0.54 | 0.84 | 0.332 |
Cosmos 1.0 Diffusion Video2World 7B | prompt + 9 frames | 21.06 | 0.69 | 0.86 | 0.592 |
Cosmos 1.0 Diffusion Video2World 14B | prompt + 1 frame | 16.81 | 0.52 | 0.84 | 0.338 |
Cosmos 1.0 Diffusion Video2World 14B | prompt + 9 frames | 20.21 | 0.64 | 0.86 | 0.598 |
Cosmos 1.0 Autoregressive 4B | 1 frame | 17.91 | 0.49 | 0.83 | 0.394 |
Cosmos 1.0 Autoregressive 4B | 9 frames | 18.13 | 0.48 | 0.86 | 0.481 |
Cosmos 1.0 Autoregressive Video2World 5B | prompt + 1 frame | 17.67 | 0.48 | 0.82 | 0.376 |
Cosmos 1.0 Autoregressive Video2World 5B | prompt + 9 frames | 18.29 | 0.48 | 0.86 | 0.481 |
Cosmos 1.0 Autoregressive Video2World 12B | 1 frame | 17.94 | 0.49 | 0.83 | 0.395 |
Cosmos 1.0 Autoregressive Video2World 12B | 9 frames | 18.22 | 0.49 | 0.87 | 0.487 |
Cosmos 1.0 Autoregressive Video2World 13B | prompt + 1 frame | 18 | 0.49 | 0.83 | 0.397 |
Cosmos 1.0 Autoregressive Video2World 13B | prompt + 9 frames | 18.26 | 0.48 | 0.87 | 0.482 |
Results
Cosmos world foundation models show strong adherence to physical laws (Table 2), particularly with increased conditioning data. Post-training on camera conditioning dataset achieves a twofold increase in pose estimation success rate compared to baseline models. However, challenges like object impermanence (where objects vanish or appear unexpectedly) and implausible behaviors (such as violating gravity) highlight areas for improvement.
Customizing for physical AI applications with Cosmos and NVIDIA Omniverse
- Video search and understanding: Simplifies video tagging and search by understanding spatial and temporal patterns, making training data preparation easier.
- Controllable 3D-to-real synthetic data generation: With NVIDIA Omniverse, developers can create 3D scenarios and use Cosmos to generate photorealistic videos that are precisely controlled by 3D scenes for highly tailored synthetic datasets.
- Policy model development and evaluation: World foundation models fine-tuned for action-conditioned video prediction enable scalable, reproducible evaluation of policy models—strategies mapping states to actions—reducing reliance on risky real-world tests or complex simulations for tasks like obstacle navigation or object manipulation.
- Foresight for action selection: Cosmos equips physical AI models with predictive capabilities to assess the outcomes of potential actions.
- Multiverse simulation: Using Cosmos and NVIDIA Omniverse, developers can simulate multiple future outcomes to help AI models evaluate and select the best strategy for achieving its goals, benefiting applications like predictive maintenance and autonomous decision-making.
From generalist to customized specialist models
Cosmos introduces a two-stage approach to world model training.
Generalist models: Cosmos world foundation models are built as generalists, trained on extensive datasets that encompass diverse real-world physics and environments. These open models are capable of handling a broad range of scenarios, from natural dynamics to robotic interactions, providing a solid foundation for any physical AI task.
Specialist models: Developers can fine-tune generalist models using smaller, targeted datasets to create specialists tailored for specific applications, such as autonomous driving or humanoid robotics or they can generate customized synthetic scenarios, such as night scenes with emergency vehicles or high-fidelity industrial robotics environments. This fine-tuning process significantly reduces the required data and training time compared to training models from scratch.
Cosmos accelerates training and fine-tuning with efficient video processing pipelines, highly performant tokenizer, and advanced training frameworks, enabling developers to address operational needs and edge cases for advancing physical AI.
Accelerated data processing with NVIDIA NeMo Curator
Training models require curated, high-quality data, which is time and resource-intensive. NVIDIA Cosmos includes a data processing and curation pipeline powered by NVIDIA NeMo Curator and optimized for NVIDIA data center GPUs.
NVIDIA NeMo Curator enables robotics and AV developers to process vast datasets efficiently. For example, 20 million hours of video can be processed in 40 days on NVIDIA Hopper GPUs, or just 14 days on NVIDIA Blackwell GPUs—compared to 3.4 years on unoptimized CPU pipelines.
Key benefits include:
- 89x faster curation: Dramatically reduces processing time
- Scalability: Handles 100+ PB of data seamlessly
- High throughput: Advanced filtering, captioning, and embedding ensure quality without sacrificing speed
High-fidelity compression and reconstruction with Cosmos Tokenizer
After data is curated, it must be tokenized for training. Tokenization breaks down complex data into manageable units, enabling models to process and learn from it more efficiently.
Cosmos tokenizers simplify this process with faster compression and visual reconstruction while preserving quality, reducing costs and complexity. For autoregressive models, the discrete tokenizer compresses data 8x in time and 16×16 in space, processing up to 49 frames at once. For diffusion models, the continuous tokenizer achieves 8x time and 8×8 space compression, handling up to 121 frames.
Fine-tuning with NVIDIA NeMo
Developers can fine-tune Cosmos world foundation models using the NVIDIA NeMo Framework. NeMo Framework accelerates model training on GPU-powered systems, whether enhancing an existing model or building a new one, from on-premises data centers to the cloud.
NeMo Framework efficiently loads multimodal data by:
- Sharding terabyte size dataset into compressed files to reduce IO overhead.
- Deterministically saving and loading datasets to avoid repetition and minimize compute waste.
- Reducing network bandwidth when exchanging data using optimized communications.
Get started with NVIDIA Cosmos
Cosmos world foundation models are open and available on NGC and Hugging Face. Developers can also run Cosmos world foundation models on the NVIDIA API catalog. Also available on the API catalog are Cosmos tools to enhance text prompts for accuracy, an inbuilt watermarking system that enables easy future identification of AI-generated sequences, and a specialized model to decode video sequences for augmented reality applications. To learn more, watch the demo.
NeMo Curator for accelerated data processing pipelines is available as a managed service and SDK. Developers can now apply for early access. Cosmos tokenizers are open neural networks available on GitHub and Hugging Face.
Get started with NVIDIA Cosmos.
Advancing Physical AI with NVIDIA Cosmos World Foundation Model Platform