R²D²: Boost Robot Training with World Foundation Models and Workflows from NVIDIA Research

SOURCE | 8 months ago

✨ Enhance your Social Media content with NViNiO•AI™ for FREE

As physical AI systems advance, the demand for richly labeled datasets is accelerating beyond what we can manually capture in the real world. World foundation models (WFMs), which are generative AI models trained to simulate, predict, and reason about future world states based on the dynamics of real-world environments, can help overcome this data challenge.

NVIDIA Cosmos is a platform for WFM development for physical AI, like robotics and autonomous vehicles. Cosmos WFMs include three model types that can be post-trained for specific applications—Cosmos Predict, Cosmos Transfer, and Cosmos Reason.

Cosmos Predict generates “future world states” as videos from image, video, and text prompts. Cosmos Transfer enables developers to perform photoreal style transfers from 2D inputs and text prompts. Cosmos Reason is a reasoning VLM that can then curate and annotate the generated data, and also be post-trained to function as a robot vision-language-action (VLA) model. This data is used to train physical AI and industrial vision AI for understanding spatial awareness, planning motion trajectories, and performing complex tasks.

This edition of NVIDIA Robotics Research and Development Digest (R2D2) explores Cosmos WFMs and workflows from NVIDIA Research. We dive into how they play an important role in synthetic data generation (SDG) and data curation for physical AI applications:

Cosmos Predict Single2MultiView for autonomous vehicles Cosmos-Drive-Dreams NVIDIA Isaac GR00T-Dreams DiffusionRenderder Accelerated video generation Cosmos Transfer Cosmos Transfer for Autonomous Vehicles Edge model distillation Cosmos Reason

Cosmos Predict: future simulation models from NVIDIA Research for Robotics

Cosmos Predict models can be post-trained for physical AI applications, like robotics and autonomous vehicles. Cosmos Predict takes input in the form of text, images, or videos and generates future frames that are coherent and physically accurate. This accelerates SDG for post-training AI models to perform complex physical tasks. Let’s see some examples of post-training.

Cosmos Predict post-training applications

Single2MultiView for autonomous vehicles is a post-trained version of the Cosmos Predict model. It generates multiple, consistent camera perspectives from a single front-view autonomous driving video. The result is synchronized multi-view camera footage for autonomous vehicle (AV) development.

Video 1. Multiple camera views generated from a single video by post-training Cosmos Predict

Inference example with a single-view input video:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_view_extend_multiview.py \ --checkpoint_dir checkpoints \ --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World-Sample-AV-Single2MultiView/t2w_model.pt \ --view_condition_video assets/diffusion/sv2mv_input_view.mp4 \ --num_input_frames 1 \ --condition_location "first_cam" \ --prompt "${PROMPT}" \ --prompt_left "${PROMPT_LEFT}" \ --prompt_right "${PROMPT_RIGHT}" \ --prompt_back "${PROMPT_BACK}" \ --prompt_back_left "${PROMPT_BACK_LEFT}" \ --prompt_back_right "${PROMPT_BACK_RIGHT}" \ --video_save_name diffusion-single2multiview-text2world

Cosmos-Drive-Dreams is a workflow for generating challenging driving conditions for AVs. The Cosmos Drive models have been post-trained for the driving domain to generate driving data that is multi-view, high-fidelity, and spatiotemporally consistent. The generated multiview data is then amplified with a post-trained Cosmos Transfer model to improve generalization in low-visibility conditions such as fog and rain, for tasks like 3D lane detection, 3D object detection, and driving policy learning.

Synthetic videos generated using Cosmos Drive Dreams. Visual variation is generated by giving descriptions like ‘rainforest’ or ‘cyberpunk city’ and corner cases like ‘zebra on road’.

Figure 1. Diverse synthetic videos generated using Cosmos Drive Dreams

Isaac GR00T-Dreams, based on DreamGen research, is a blueprint for large-scale synthetic trajectory data generation, a real-to-real data workflow for humanoid robot training. GR00T-Dreams uses Cosmos Predict to create diverse, photorealistic videos of robots performing tasks. It does this from image and text prompts, and extracts action data, called neural trajectories, for training robot policies. This helps train robots on new skills and adapt to different environments with minimal human demonstrations.

A GIF of a neural trajectory from a humanoid robot’s perspective, and a video of the real robot executing the same task of watering a plant.

Figure 2. Neural trajectory (left) and real-robot execution (right) of a plant-watering task

Example of post-training GR00T on GR1 data:

EXP=predict2_video2world_training_2b_groot_gr1_480 torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

DiffusionRenderer is a neural rendering framework that enables photorealistic relighting, material editing, and object insertion from a single video input without requiring explicit 3D geometry or lighting data. It leverages video diffusion models to estimate scene properties, then generates realistic new images. Using Cosmos Predict’s diffusion model improves the quality of DiffusionRenderer’s lighting capability, enabling more accurate and temporally consistent results. This is helpful for physical AI simulation, since it makes scene editing highly efficient and controllable.

Images and videos showing rendering and re-lighting capabilities of DiffusionRenderer.

Figure 3. DiffusionRenderer is a framework for image and video de-lighting and re-lighting, built on Cosmos

A diagram showing the DiffusionRenderer method.

Figure 4. Overview of the DiffusionRenderer method

Here is a sample command for video re-lighting. This applies novel lighting to frames from the inverse renderer and generates relit video frames:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/inference_forward_renderer.py \ --checkpoint_dir checkpoints \ --diffusion_transformer_dir Diffusion_Renderer_Forward_Cosmos_7B \ --dataset_path=asset/example_results/video_delighting/gbuffer_frames \ --num_video_frames 57 \ --envlight_ind 0 1 2 3 \ --use_custom_envmap=True \ --video_save_folder=asset/example_results/video_relighting/

Accelerated video generation Cosmos-Predict2 now uses Neighborhood Attention (NATTEN), improving its focus on relevant video regions. This attention system is layer-adaptive and dynamically balances global and local context for optimal speed and quality. By implementing sparse attention within model layers, unnecessary computation during video generation is minimized. NATTEN’s efficiency is further boosted by hardware-optimized backend code, specifically designed for NVIDIA hardware. As a result, video inference is 2 to 2.5 times faster on advanced GPUs such as the NVIDIA H100 and NVIDIA B200.

Cosmos Transfer: controlled synthetic data generation for robotics and AVs

Cosmos Transfer models generate world simulations based on multiple control inputs like segmentation maps, depth, edge maps, lidar scans, keypoints, and HD maps. These different modalities enable users to control scene composition while generating diverse visual features via user text prompts. The aim is to augment synthetic datasets with large visual diversity and improve overall sim-to-real transfer in robotics and autonomous driving applications.

Cosmos Transfer applications

Let’s now take a look at some workflows that use Cosmos Transfer.

Cosmos Transfer for AVs generates new conditions, such as weather, lighting, and terrain, from a single driving scenario using different text prompts. It uses multimodal controls as inputs to amplify data variation, such as in the Cosmos Drive Dreams use case. This is helpful when creating AV training datasets because it can scale up data generation from a single video, based on user text prompts.

Different videos generated from the same input video using Cosmos Transfer and different text prompts.

Figure 5. Cosmos Transfer generates diverse conditions and edge cases from the same input video and different text prompts like ‘a snowy day’ or ‘a nighttime scene’

Example command using Cosmos Transfer to generate an RGB video from a text prompt and HD Map condition video:

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}" export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}" export NUM_GPU="${NUM_GPU:=1}" PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \ --checkpoint_dir $CHECKPOINT_DIR \ --video_save_folder outputs/example1_single_control_edge_distilled \ --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \ --offload_text_encoder_model \ --offload_guardrail_models \ --num_gpus $NUM_GPU \ --use_distilled

Edge model distillation is an improved version of Cosmos Transfer. The original Cosmos Transfer model required 70 passes to generate a video, incurring large computational costs. Model distillation, for the edge modality, has produced a smaller student model that performs the same task in a single step, closely matching the original model’s quality. Other control modalities (like depth, segmentation, HDMap, and Lidar) can be distilled similarly, achieving performance improvements. Reducing computational work for video generation enables faster and more affordable deployment. The distilled variant can be enabled through the --use_distilled flag:

Cosmos Reason: long-horizon reasoning for physical AI

As a world foundation model focused on reasoning for physical AI, Cosmos Reason understands physical common sense and generates appropriate embodied decisions through long chain-of-thought reasoning. This is useful for curating high-quality training data by using Cosmos Reason as a critic during SDG, as it understands action sequences and real-world constraints. The model has been trained in two stages: supervised fine-tuning (SFT) and reinforcement learning.

A diagram showing the Cosmos Reason architecture. The input video and text prompt are tokenized and concatenated and passed into the LLM backbone. The model outputs responses in natural language based on long chain-of-thought reasoning.

Figure 6. Overview of the Cosmos Reason architecture

SFT training can improve the Reason model’s performance on specific tasks. For example, training with the robovqa dataset can improve performance on robotics visual question answering use-cases. Here is an example command to launch SFT training:

cosmos-rl --config configs/cosmos-reason1-7b-fsdp2-sft.toml ./tools/dataset/cosmos_sft.py

Getting started

Check out the following resources to learn more:

Cosmos Predict2: Project Website, GitHub, Hugging Face, Paper Cosmos Transfer1: Project Website, GitHub, Hugging Face, Paper Cosmos Reason1: Project Website, GitHub, Hugging Face, Paper Isaac GR00T-Dreams: GitHub, Paper Cosmos-Drive-Dreams: Project Website, GitHub, Paper, Dataset DiffusionRenderer: Project Website, GitHub, Paper, Hugging Face

Experience the next era of world foundation models with NVIDIA at SIGGRAPH 2025:

A special address on Monday, Aug. 11, with NVIDIA AI research leaders Sanja Fidler, Aaron Lefohn and Ming-Yu Liu, who’ll chart the next frontier in computer graphics and physical AI. Hands-on: Learn to use NVIDIA Cosmos, a platform of generative world foundation models, to generate data and scenarios for training physical AI.

This post is part of our NVIDIA Robotics Research and Development Digest (R2D2) to give developers deeper insight into the latest breakthroughs from NVIDIA Research across physical AI and robotics applications.

Stay up to date by subscribing to the newsletter and following NVIDIA Robotics on YouTube, Discord, and developer forums. To start your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses.

Acknowledgments

For their contributions to the research mentioned in this post, thanks to Niket Agarwal, Arslan Ali, Mousavian Arsalan, Alisson Azzolini, Yogesh Balaji, Hannah Brandon, Tiffany Cai, Tianshi Cao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yin Cui, Ying Cui, Yifan Ding, Daniel Dworakowski, Francesco Ferroni, Sanja Fidler, Dieter Fox, Ruiyuan Gao, Songwei Ge, Rama Govindaraju, Siddharth Gururani, Zekun Hao, Ali Hassani, Ethan He, Fengyuan Hu, Shengyu Huang, Spencer Huang, Michael Isaev, Pooya Jannaty, Brendan Johnson, Alexander Keller, Rizwan Khan, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Elena Lantz, Tobias Lasser, Nayeon Lee, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Zhi-Hao Lin, Zongyu Lin., Ming-Yu Liu, Xian Liu, Xiangyu Lu, Yifan Lu, Alice Luo, Ajay Mandlekar, Hanzi Mao, Andrew Mathau, Seungjun Nah, Avnish Narayan, Yun Ni, Sriprasad Niverty, Despoina Paschalidou, Tobias Pfaff, Wei Ping, Morteza Ramezanali, Fabio Ramos, Fitsum Reda, Zheng Ruiyuan, Amirmojtaba Sabour, Ed Schmerling, Tianchang Shen, Stella Shi, Misha Smelyanskiy, Shuran Song, Bartosz Stefaniak, Steven Sun, Xinglong Sun, Shitao Tang, Przemek Tredak, Wei-Cheng Tseng, Nandita Vijaykumar, Andrew Z. Wang, Guanzhi Wang, Ting-Chun Wang, Zian Wang, Fangyin Wei, Xinyue Wei, Wen Xiao, Stella Xu, Yao Xu, Yinzhen Xu, Dinghao Yang, Xiaodong Yang, Zhuolin Yang, Seonghyeon Ye, Yuchong Ye, Xiaohui Zeng, Yuxuan Zhang, Zhe Zhang, Ruijie Zheng, Yuke Zhu, and Artur Zolkowski.

✨ Enhance your brand's digital communication with NViNiO•Link™ : Get started for FREE here