The Sim-to-Real Gap: What We've Learned Transferring Intelligence to Hardware

Every policy trained in simulation carries an implicit assumption: that the simulator is the world. It never is. Contact dynamics are approximated. Sensor noise is modeled statistically rather than reproduced physically. Actuators respond with idealized torque curves that real motors, under load and heat, do not honor. The gap between simulated and real performance is not a bug in the methodology — it is a fundamental tension in the approach of training embodied intelligence in software and deploying it on hardware. Over the past eighteen months, Webbeon has accumulated more than 10,000 hours of physical testing across locomotion, manipulation, and navigation tasks with the Object Class platform. This article distills what we have learned about where the gap lives, what closes it, and what remains stubbornly open.

Domain Randomization and Its Limits

Domain randomization — the practice of varying simulation parameters (friction, mass, sensor noise, actuator delay) across training episodes so that the policy learns to be robust to uncertainty — is the standard first tool for sim-to-real transfer. We use it extensively. Object Class's locomotion policy is trained across a distribution of ground friction coefficients spanning 0.2 to 1.4, mass perturbations of plus or minus 15%, and actuator latencies from 5 to 25 milliseconds. This produces policies that are broadly robust: they walk reliably on surfaces from polished concrete to wet grass without retraining. But domain randomization has a ceiling. It assumes that the real world falls within the randomized distribution, and it assumes that the simulation's physics engine correctly captures the structure of the dynamics, even if the parameters vary. Both assumptions break down. We found that randomizing friction coefficients does not help when the real failure mode is anisotropic friction — a surface that is slippery in one direction and grippy in another, like brushed metal grating. The simulator modeled friction as isotropic, so no amount of parameter randomization could produce the directional slip behaviors the robot encountered on physical catwalks. This class of structural sim-to-real gap requires not wider randomization but better simulation.

Physical Testing Methodology

Our physical testing pipeline is designed to be systematic rather than anecdotal. Every policy variant is evaluated on a standardized set of 48 physical test scenarios spanning flat ground, stairs, slopes, uneven terrain, confined spaces, and dynamic obstacles. Each scenario is run a minimum of 20 times to capture stochastic variation. We instrument the robot with ground-truth motion capture alongside its onboard sensing, allowing us to diagnose exactly where perception diverges from reality and where control diverges from intent. The most valuable data comes from failures. We maintain a failure taxonomy with five top-level categories: perception failures (the robot misestimated the geometry or properties of the environment), planning failures (the chosen strategy was infeasible given the true conditions), control failures (the low-level controller could not execute the planned motion), hardware failures (actuator faults, sensor dropouts), and interaction failures (the robot's contact with the environment produced dynamics outside the training distribution). Over 10,000 hours of testing, the distribution has shifted. Early deployments were dominated by control failures — policies that worked in simulation produced unstable gaits on hardware due to actuator bandwidth limits. After improving our actuator modeling, the dominant failure mode became interaction failures, particularly in manipulation tasks involving deformable or granular materials. This progression is itself informative: it shows which layers of the sim-to-real gap yield to engineering effort and which constitute deeper research problems.

Failure Modes in Hardware Deployment

Three failure modes deserve particular attention because they are underrepresented in the sim-to-real literature. First, thermal drift: as motors heat up during extended operation, their torque-speed characteristics shift. A policy trained against a fixed motor model gradually loses calibration over a 30-minute session. We now include thermal models in our simulation and randomize operating temperature, which reduced thermally-induced failures by 68%. Second, cable and harness dynamics: the physical robot has power cables, sensor wiring, and pneumatic lines routed across its body. These have mass, stiffness, and inertia that the simulation historically ignored. When Object Class performs high-acceleration maneuvers, cable whip can shift the center of mass by several centimeters — enough to destabilize a dynamic gait. We addressed this by adding simplified cable models to the simulator, treating harnesses as series of damped spring segments attached to the kinematic chain. Third, perceptual aliasing under environmental stress: dust, rain, fog, and direct sunlight degrade camera and lidar performance in ways that Gaussian noise models do not capture. Raindrops on a camera lens produce structured artifacts — streaks, refractions, partial occlusions — that a noise-trained policy interprets as phantom obstacles. We are developing physically-based sensor degradation models, but this remains an active research area.

The Path Forward

The sim-to-real gap is not a single problem with a single solution. It is a collection of mismatches between the simulated and physical worlds, each requiring its own remedy. Some are solved by better simulation — higher-fidelity physics engines, more accurate sensor models, tighter actuator characterization. Some are solved by more robust training — wider domain randomization, adversarial perturbations, multi-fidelity curricula that blend simulated and real data. And some will ultimately be solved by adaptation: policies that recognize when they are operating outside their training distribution and adjust online. Object Class's current architecture includes a residual adaptation module — a small network that observes the discrepancy between predicted and actual next states and applies corrective adjustments to the control signal. This module is trained during physical deployment and has reduced the aggregate sim-to-real performance gap from 31% to 14% across our benchmark suite. Closing the remaining gap is among the hardest open problems in embodied AI, and we expect it to be a multi-year effort. But the trajectory is clear: each round of physical testing reveals the specific deficiencies in our simulation, each simulation improvement narrows the gap, and the policies that emerge from the cycle are measurably more capable in the physical world.

Related Research

2026-03-12

Object Class: First Steps: Learning Dexterous Manipulation from Scratch

2026-02-25

Navigation Without Maps: Embodied Intelligence in Unstructured Environments

The Sim-to-Real Gap: What We've Learned Transferring Intelligence to Hardware

Domain Randomization and Its Limits

Physical Testing Methodology

Failure Modes in Hardware Deployment

The Path Forward