Synthetic Reality for ML Training

Simulation experts discuss safeguards in using synthetic data for autonomous vehicle training.

Simulation experts discuss safeguards in using synthetic datafor autonomous vehicle training.

At the 2024 Computer Vision and Pattern Recognition Conference, NVIDIA revealed NVIDIA Omniverse Cloud Sensor RTX, a set of microservices that allow you to simulate physically accurate sensor interactions. Image courtesy of NVIDIA.


It’s becoming the status quo in machine learning (ML) or artificial intelligence (AI) training, to use synthetic data as a way to augment real-world data. This is especially true of autonomous vehicle (AV) development, where drive-data collection is time-consuming and costly.

“The industry’s long-standing reliance on road test drives … has grown over the past decade into an estimated average annual testing cost of $800M to $2.7B per program (largely depending on the mix of physical world testing and simulations),” reports Foretellix, which specializes in verification and validation of autonomous driving systems (“Navigating the AV Industry’s Cost Crisis,” September 2023).

In this article, we look at the practice of producing synthetic data based on real-world and simulation data to understand how it affects the resulting navigation algorithm’s reliability.

Types of Data Used in ML

In ML, AV makers often employ a mix of real-world, simulation, and synthetic data. The gold standard is real-world data, captured from the sensors on a test vehicle during a test drive. Carmakers may also use drive-simulation software, such as NVIDIA Drive, to generate virtual drive data, which adheres to the rules of real-world physics. This is often referred to as simulation data or synthetic data.

“In some cases, you are bootstrapping deep neural network training in the absence of real data pertaining to a scenario, where you might rely more on synthetic data,” says Gautham Sholingar, senior product manager of AV Simulation and Omniverse at NVIDIA. “In other cases, a small amount of high-quality synthetic data can complement an existing real-world dataset and can be used to fill specific gaps or augment with variations of an existing scenario.”

In simulation, it usually takes less than actual drive time to simulate a drive. For instance, it takes far less than 20 minutes to simulate a 20-minute drive, especially in graphics processing unit (GPU)-accelerated programs such as NVIDIA Drive. For the most part, a real 20-minute drive might yield 90%–95% of normal driving activities, with only a few minutes of training-worthy maneuvers. The advantage of using simulation is, you can skip the normal driving activities and concentrate solely on the uncommon scenarios that are most likely to stump the vehicle.

In Ansys’ company blog post titled “Simulating Reality: The Importance of Synthetic Data in AI/ML Systems for Radar Applications,” author Arien Sligar, senior principal product specialist at Ansys, writes, “Simulation can generate large quantities of synthetic data that cover a wide range of scenarios, including instances that are rare, difficult or dangerous to observe in the real world.”

Ansys AVxcelerate lets you generate synthetic data to augment real-world drive data for autonomous vehicle training. Image courtesy of Ansys.

Collecting data for a normal drive is not difficult, but it is costly. It requires equipping a special vehicle with computer modules and sensors to record the changing acceleration, deceleration, and lane changes, along with the obstacles that prompt these maneuvers. Accidents and crashes, on the other hand, are nearly impossible to anticipate; therefore, the associated data cannot easily be captured. It can only be obtained through simulation, or from published cases. This is what Sligar means by “instances that are rare, difficult or dangerous to observe in the real world.”

As principal application engineer at SmartUQ, Brian Leyde develops software tools that can validate the reliability of simulation results, such as those generated with finite element analysis or computational fluid dynamics software. The discipline is known as verification, validation and uncertainty quantification.

SmartUQ offers software to verify the reliability of simulation data. Shown here is the Bayesian distribution (orange) mapped onto the original Design of Experiments. Image courtesy of SmartUQ.

“If the source of the synthetic data is physics-based simulation, then the process of validating its uncertainty is the same as validating the simulation data itself. We would use Frequentist and Bayesian methods to look at the uncertainty distributions,” Leyde says. “This is where cross-validation is incredibly important. With cross-validation, you get a general idea of the base rate of errors,” which could be used to quantify the uncertainty and incorporated into the prediction algorithms.

Ghost Targets

Entering a tunnel is a routine event for human drivers, but for sensors on an AV, it could be a source of confusion. “As the vehicle approaches the tunnel, the electromagnetic waves can bounce around in that environment and create what’s called ghost targets in the radar perception. So it could think there’s a car where there isn’t, or it may not see a car that’s there,” Sligar says.

These ghost targets are an artifact that results from multiple reflections of the electromagnetic waves, appearing at locations that are not consistent with the real world, says Sligar.

For the purpose of ML training, the dataset needs to cover variations of the same scenario that the vehicle might encounter. “Think of the shape of the tunnel. Sometimes it’s round; sometimes it’s square,” says Sligar. “Those types of variations, along with the vehicle’s speed of entry and the presence of pedestrians, are very easy to produce in simulation as part of your scene generation.”

Suppose you have enough real-world drive data for nighttime driving, but the daytime drive data is a bit thin for ML. “This is exactly where you might use simulation,” says Sligar. “You could change the sun’s position as an input parameter to create variations in the synthetic data.”

So long as the synthetic data is based on real-world physics—in sensor simulation, that means the physics of light—Sligar is confident the data is reliable. However, he cautions, as best practice demands, “the simulated data needs to be validated to ensure it accurately represents real-world events.”

Similarly, Leyde warns, “At the highest level, you don’t want to use synthetic data as the only source when training anything for safety-critical applications.”

Tools to Create Synthetic Data

Ansys offers Ansys AVxcelerate, developed specifically to test and validate sensor perception with physically accurate sensor simulation. It also offers Ansys Perceive EM, which can simulate real-time radar and wideband 5G/6G channels. It uses physical optics-based shooting and bouncing rays (SBR) technology, accessible through a lightweight application programming interface. Both products are GPU-accelerated.

“Our radar-based solvers are written purely as GPU solvers. They’re very fast and approaching real time for physics-based simulation,” says Sligar. The solvers in AVxcelerate are based on the time-tested Ansys product HFSS SBR+, an asymptotic high-frequency electromagnetic (EM) simulator.

In January 2024, Ansys and NVIDIA announced that Ansys AVxcelerate Sensors will be accessible within NVIDIA DRIVE Sim, a scenario-based AV simulator powered by NVIDIA Omniverse, an immersive simulation environment.

Synthetic data generation often involves modeling the physics of sensors, such as cameras, lidar sensors and radar equipment, along with their interactions with the environment. Such compute-intensive tasks can be accelerated with dedicated hardware like NVIDIA RTX GPUs. NVIDIA also offers microservices such as Omniverse Cloud Sensor RTX (announced at the 2024 Computer Vision and Pattern Recognition Conference) for developers to build and enhance their own simulation applications with high-fidelity sensor simulation.

SmartUQ software can also generate the synthetic data you need based on training ML models with existing data from simulations or tests.

“You should stay close to the conditions in which the physical data is collected. You should stay within the same proven and well-understood physics regime as the ML training data. In automotive and aerospace applications for example, the physics regimes could be supersonic vs. subsonic flights, or the behaviors of tires on the road for different conditions,” Leyde advises.

Suppose you have real-world drive data on how an AV’s tires behave on dry, dusty, and gravel roads. Leyde says you could be fairly confident in the reliability of the synthetic data you generate by varying the road’s dryness, dustiness, and gravel density. But if you attempt to predict how the vehicle might behave on wet roads based on that data, “then you are extrapolating beyond the known scenarios,” he warns.

Why Don’t We Share?

Considering the growing number of autonomous vehicle developers, the real-world data that someone needs may already exist, but accessing it is another matter. A Consumer Reports article explains, “Having each developer go it alone could add years to the process, while exchanging information could speed development … However, sharing test data raises significant privacy concerns not only for drivers but car companies anxious to protect their trade secrets” (“Should Developers of Driverless Cars Share Test Data?,” December 2016).

Leyde points out that many simulation software makers, such as Ansys or COMSOL, have collected and published case studies as part of the process to validate their physics engines and simulation solvers. If the data is publicly available, it could save new carmakers lots of expenses in testing and validation. 

More Ansys Coverage

More NVIDIA Coverage

Share This Article

Subscribe to our FREE magazine, FREE email newsletters or both!

Join over 90,000 engineering professionals who get fresh engineering news as soon as it is published.


About the Author

Kenneth Wong's avatar
Kenneth Wong

Kenneth Wong is Digital Engineering’s resident blogger and senior editor. Email him at kennethwong@digitaleng.news or share your thoughts on this article at digitaleng.news/facebook.

      Follow DE
#29408