NVIDIA GTC25 Building a Synthetic Motion Generation Pipeline for Humanoid Robot Learning : US Pioneer Global VC DIFCHQ SFO Singapore – Riyadh Swiss Our Mind

General-purpose humanoid robots are designed to adapt quickly to existing human-centric urban and industrial work spaces, tackling tedious, repetitive, or physically demanding tasks. These mobile robots naturally excel in human-centric environments, making them increasingly valuable from factory floors to healthcare facilities.

Imitation learning, a subset of robot learning, enables humanoids to acquire new skills by observing and mimicking expert human demonstrations, either from real videos of humans from teleoperation demonstrations or from simulated data. Imitation learning uses labeled datasets and is advantageous for teaching robots complex actions in diverse environments that are difficult to define programmatically.

While recording a demonstration may be simpler than specifying a reward policy, creating a perfect demonstration can be challenging, and robots may struggle with unforeseen scenarios. Collecting extensive, high-quality datasets in the real world is tedious, time-consuming, and prohibitively expensive. However, synthetic data generated from physically accurate simulations can accelerate the data-gathering process.

The NVIDIA Isaac GR00T blueprint for synthetic manipulation motion generation is a reference workflow built on NVIDIA Omniverse and NVIDIA Cosmos. It creates exponentially large amounts of synthetic motion trajectories for robot manipulation from a small number of human demonstrations.

Using the first components available for the blueprint, NVIDIA was able to generate 780K synthetic trajectories—the equivalent of 6.5K hours, or 9 continuous months, of human demonstration data—in just 11 hours.  Then, combining the synthetic data with real data, NVIDIA improved the GR00T N1 performance by 40%, compared to using only real data.

Video 1. Streamline Data Collection with NVIDIA Isaac GR00T 

In this post, we describe how to use a spatial computing device, such as the Apple Vision Pro or another capture device like a space mouse, to portal into a simulated robot’s digital twin and record motion demonstrations by teleoperating the simulated robot. These recordings are then used to generate a larger set of physically accurate synthetic motion trajectories. The blueprint further augments the dataset by producing an exponentially large, photorealistic, and diverse set of training data. We then post-train a robot policy model using this data.

Blueprint overview

A diagram shows components such as GR00T-Teleop, GR00T-Mimic, and GR00T-Gen.
Figure 1. NVIDIA Isaac GR00T blueprint architecture

Key components of the workflow include the following:

  • GR00T-Teleop: Coming soon, but you can use the sample data provided in the blueprint now.
    • NVIDIA CloudXR Runtime: Streams Isaac Lab simulations to an Apple Vision Pro and receives control data for humanoid teleoperation.
    • Isaac XR Teleop sample app for Apple Vision Pro: Enables the user to interact immersively with Isaac Lab simulations streamed from CloudXR Runtime and sends back control data for humanoid teleoperation.
  • GR00T-Mimic: Uses the recorded demonstrations as input to generate additional synthetic motion trajectories in Isaac Lab. The first release of this blueprint is for single-arm manipulation only. Support for bi-manual humanoid robot manipulation is coming soon.
  • GR00T-Gen: Add additional diversity by randomizing background, lighting, and other variables in the scene and augment the generated images through NVIDIA Cosmos Transfer.
  • Isaac Lab: Train robot policies with an open-source, unified framework for robot learning. Isaac Lab is built on top of NVIDIA Isaac Sim.
A diagram shows CloudXR connecting the Apple Vision Pro to a GPU-accelerated system that runs Isaac Lab and Isaac Sim for humanoid teleoperation.
Figure 2. Teleoperation architecture

The workflow begins with data collection, where a high-fidelity device like the Apple Vision Pro is used to capture human movements and actions in a simulated environment. The Apple Vision Pro streams hand tracking data to a simulation platform such as Isaac Lab, which simultaneously streams an immersive view of the robot’s environment back to the device. This setup enables the intuitive and interactive control of the robot, facilitating the collection of high-quality teleoperation data.

The robot simulation in Isaac Lab is streamed to Apple Vision Pro, enabling you to visualize the robot’s environment. By moving your hands, you can intuitively control the robot to perform various tasks. This setup facilitates an immersive and interactive teleoperation experience.

A GIF shows a human using an Apple Vision Pro and a robot in a simulated environment mimicking the human’s actions.
Figure 3. Teleoperation in Isaac Lab

Synthetic manipulation motion trajectory generation using GR00T-Mimic

After the data is collected, the next step is synthetic trajectory generation. Isaac GR00T-Mimic is used to extrapolate from a small set of human demonstrations to create a vast number of synthetic motion trajectories.

This process involves annotating key points in the demonstrations and using interpolation to ensure that the synthetic trajectories are smooth and contextually appropriate. The generated data is then evaluated and refined to meet the criteria required for training.

In this example, we successfully generated 1K synthetic trajectories.

A GIF shows multiple trajectories of a manipulator arm stacking blocks on top of each other.
Figure 4. A set of synthetic trajectories generated in Isaac Lab

Augmenting and generating large data and diverse dataset 

To reduce the simulation-to-real gap, it’s critical to augment the synthetically generated images to the necessary photorealism and also to increase the diversity by randomizing various parameters, such as lighting, color, and background.

Typically, this process entails building photorealistic 3D scenes and objects and requires considerable time and expertise. With Cosmos Transfer (WFMs), this process can be accelerated considerably from hours to minutes with simple text prompts.

Figures 5 and 6 show an example of the photorealism that can be achieved by passing the synthetically generated image through NVIDIA Cosmos Transfer WFM.

A robotic arm manipulates colored cubes on a beige surface, demonstrating movement and coordination.
Figure 5. Synthetically generated image created in Isaac Lab
A robotic arm manipulates colored cubes on a marble surface, demonstrating movement and coordination.
Figure 6. Synthetic image passed through photorealism with NVIDIA Cosmos Transfer WFM

Post-training in Isaac Lab using imitation learning

Finally, the synthetic dataset is used to train the robot using imitation learning techniques. In this stage, a policy, such as a recurrent Gaussian mixture model (GMM) from the Robomimic suite, is trained to mimic the actions demonstrated in the synthetic data. The training is conducted in a simulation environment such as Isaac Lab, and the performance of the trained policy is evaluated through multiple trials.

To show how this data can be used, we trained a Franka robot with a gripper to perform a stacking task in Isaac Lab. We used Behavioral Cloning with a recurrent GMM policy from the Robomimic suite. The policy uses two long short-term memory (LSTM) layers with a hidden dimension of 400.

The input to the network consists of the robot’s end-effector pose, gripper state, and relative object poses while the output is a delta pose action used to step the robot in the Isaac Lab environment.

With a dataset consisting of 1K successful demonstrations and 2K iterations, we achieved a training speed of approximately 50 iterations/sec (equivalent to approximately 0.5 hours of training time on the NVIDIA RTX 4090 GPU). Averaging over 50 trials, the trained policy achieved an 84% success rate for the stacking task.

GIF of a manipulator arm stacking blocks on top of each other.
Figure 7. Gripper trained in Isaac Lab

Workflow benefits

The major advantage of this method lies in the time savings during the data collection process, as evidenced by success rates across various manipulation tasks, from stacking cubes to threading needles.

Traditionally, properly trained human operators take about one minute to record a single high-quality demonstration, which is difficult to scale due to the significant human effort required and the potential for errors. In contrast, this new method achieves similar success rates using a combination of a few human demonstrations and synthetic data, reducing data collection time from hours to minutes.

The inclusion of the NVIDIA Cosmos enables you to augment synthetic images to achieve the required photorealism, effectively reducing simulation-to-real gaps using just text prompts. This approach significantly streamlines the data collection process and enables you to generate large, diverse sets of data while maintaining or enhancing the quality of the resulting robot policies.

Developers adopting the blueprint

Humanoid developers such as AgibotMentee RoboticsUCR, and X-Humanoid have integrated the components of the blueprint in their humanoid robot development pipeline.

Other companies, such as Field AILab0Miso RoboticsRIVR, and Sanctuary AI, are also leveraging the Isaac simulation framework. They are using it to develop robot brains and software stacks and to test and validate physical robots.

Get started

In this post, we discussed how to collect, generate and augment data required to train a single-arm manipulator with the NVIDIA Isaac GR00T synthetic manipulation motion generation blueprint.

The first release of this blueprint is for single-arm manipulation only. Support for bi-manual humanoid robot manipulation is coming soon.

For more detailed information about NVIDIA Isaac GR00T, watch the GTC Keynote from NVIDIA CEO Jensen Huang and key GTC sessions, including An Introduction to Building Humanoid Robots.

Stay up to date by subscribing to our newsletter and following NVIDIA Robotics on YouTubeDiscord, and developer forums.

Related resources