As physical AI systems advance, the demand for richly labeled datasets is accelerating beyond what we can manually capture in the real world. World foundation models (WFMs), which are generative AI models trained to simulate, predict, and reason about future world states based on the dynamics of real-world environments, can help overcome this data challenge.
NVIDIA Cosmos is a platform for WFM development for physical AI, like robotics and autonomous vehicles. Cosmos WFMs include three model types that can be post-trained for specific applications—Cosmos Predict, Cosmos Transfer, and Cosmos Reason.
Cosmos Predict generates “future world states” as videos from image, video, and text prompts. Cosmos Transfer enables developers to perform photoreal style transfers from 2D inputs and text prompts. Cosmos Reason is a reasoning VLM that can then curate and annotate the generated data, and also be post-trained to function as a robot vision-language-action (VLA) model. This data is used to train physical AI and industrial vision AI for understanding spatial awareness, planning motion trajectories, and performing complex tasks.
This edition of NVIDIA Robotics Research and Development Digest (R2D2) explores Cosmos WFMs and workflows from NVIDIA Research. We dive into how they play an important role in synthetic data generation (SDG) and data curation for physical AI applications:
Cosmos Predict Single2MultiView for autonomous vehicles Cosmos-Drive-Dreams NVIDIA Isaac GR00T-Dreams DiffusionRenderder Accelerated video generation Cosmos Transfer Cosmos Transfer for Autonomous Vehicles Edge model distillation Cosmos ReasonCosmos Predict: future simulation models from NVIDIA Research for Robotics
Cosmos Predict models can be post-trained for physical AI applications, like robotics and autonomous vehicles. Cosmos Predict takes input in the form of text, images, or videos and generates future frames that are coherent and physically accurate. This accelerates SDG for post-training AI models to perform complex physical tasks. Let’s see some examples of post-training.
Cosmos Predict post-training applications
Single2MultiView for autonomous vehicles is a post-trained version of the Cosmos Predict model. It generates multiple, consistent camera perspectives from a single front-view autonomous driving video. The result is synchronized multi-view camera footage for autonomous vehicle (AV) development.
Video 1. Multiple camera views generated from a single video by post-training Cosmos PredictInference example with a single-view input video:
Figure 1. Diverse synthetic videos generated using Cosmos Drive Dreams
Figure 2. Neural trajectory (left) and real-robot execution (right) of a plant-watering taskExample of post-training GR00T on GR1 data:
Figure 3. DiffusionRenderer is a framework for image and video de-lighting and re-lighting, built on Cosmos
Figure 4. Overview of the DiffusionRenderer methodHere is a sample command for video re-lighting. This applies novel lighting to frames from the inverse renderer and generates relit video frames:
Cosmos Transfer: controlled synthetic data generation for robotics and AVs
Cosmos Transfer models generate world simulations based on multiple control inputs like segmentation maps, depth, edge maps, lidar scans, keypoints, and HD maps. These different modalities enable users to control scene composition while generating diverse visual features via user text prompts. The aim is to augment synthetic datasets with large visual diversity and improve overall sim-to-real transfer in robotics and autonomous driving applications.
Cosmos Transfer applications
Let’s now take a look at some workflows that use Cosmos Transfer.
Cosmos Transfer for AVs generates new conditions, such as weather, lighting, and terrain, from a single driving scenario using different text prompts. It uses multimodal controls as inputs to amplify data variation, such as in the Cosmos Drive Dreams use case. This is helpful when creating AV training datasets because it can scale up data generation from a single video, based on user text prompts.
Figure 5. Cosmos Transfer generates diverse conditions and edge cases from the same input video and different text prompts like ‘a snowy day’ or ‘a nighttime scene’Example command using Cosmos Transfer to generate an RGB video from a text prompt and HD Map condition video:
Cosmos Reason: long-horizon reasoning for physical AI
As a world foundation model focused on reasoning for physical AI, Cosmos Reason understands physical common sense and generates appropriate embodied decisions through long chain-of-thought reasoning. This is useful for curating high-quality training data by using Cosmos Reason as a critic during SDG, as it understands action sequences and real-world constraints. The model has been trained in two stages: supervised fine-tuning (SFT) and reinforcement learning.
Figure 6. Overview of the Cosmos Reason architectureSFT training can improve the Reason model’s performance on specific tasks. For example, training with the robovqa dataset can improve performance on robotics visual question answering use-cases. Here is an example command to launch SFT training:
Getting started
Check out the following resources to learn more:
Cosmos Predict2: Project Website, GitHub, Hugging Face, Paper Cosmos Transfer1: Project Website, GitHub, Hugging Face, Paper Cosmos Reason1: Project Website, GitHub, Hugging Face, Paper Isaac GR00T-Dreams: GitHub, Paper Cosmos-Drive-Dreams: Project Website, GitHub, Paper, Dataset DiffusionRenderer: Project Website, GitHub, Paper, Hugging FaceExperience the next era of world foundation models with NVIDIA at SIGGRAPH 2025:
A special address on Monday, Aug. 11, with NVIDIA AI research leaders Sanja Fidler, Aaron Lefohn and Ming-Yu Liu, who’ll chart the next frontier in computer graphics and physical AI. Hands-on: Learn to use NVIDIA Cosmos, a platform of generative world foundation models, to generate data and scenarios for training physical AI.This post is part of our NVIDIA Robotics Research and Development Digest (R2D2) to give developers deeper insight into the latest breakthroughs from NVIDIA Research across physical AI and robotics applications.
Stay up to date by subscribing to the newsletter and following NVIDIA Robotics on YouTube, Discord, and developer forums. To start your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses.
Acknowledgments
For their contributions to the research mentioned in this post, thanks to Niket Agarwal, Arslan Ali, Mousavian Arsalan, Alisson Azzolini, Yogesh Balaji, Hannah Brandon, Tiffany Cai, Tianshi Cao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yin Cui, Ying Cui, Yifan Ding, Daniel Dworakowski, Francesco Ferroni, Sanja Fidler, Dieter Fox, Ruiyuan Gao, Songwei Ge, Rama Govindaraju, Siddharth Gururani, Zekun Hao, Ali Hassani, Ethan He, Fengyuan Hu, Shengyu Huang, Spencer Huang, Michael Isaev, Pooya Jannaty, Brendan Johnson, Alexander Keller, Rizwan Khan, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Elena Lantz, Tobias Lasser, Nayeon Lee, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Zhi-Hao Lin, Zongyu Lin., Ming-Yu Liu, Xian Liu, Xiangyu Lu, Yifan Lu, Alice Luo, Ajay Mandlekar, Hanzi Mao, Andrew Mathau, Seungjun Nah, Avnish Narayan, Yun Ni, Sriprasad Niverty, Despoina Paschalidou, Tobias Pfaff, Wei Ping, Morteza Ramezanali, Fabio Ramos, Fitsum Reda, Zheng Ruiyuan, Amirmojtaba Sabour, Ed Schmerling, Tianchang Shen, Stella Shi, Misha Smelyanskiy, Shuran Song, Bartosz Stefaniak, Steven Sun, Xinglong Sun, Shitao Tang, Przemek Tredak, Wei-Cheng Tseng, Nandita Vijaykumar, Andrew Z. Wang, Guanzhi Wang, Ting-Chun Wang, Zian Wang, Fangyin Wei, Xinyue Wei, Wen Xiao, Stella Xu, Yao Xu, Yinzhen Xu, Dinghao Yang, Xiaodong Yang, Zhuolin Yang, Seonghyeon Ye, Yuchong Ye, Xiaohui Zeng, Yuxuan Zhang, Zhe Zhang, Ruijie Zheng, Yuke Zhu, and Artur Zolkowski.
.png)
7 months ago
English (United States) ·
French (France) ·