Object's First Steps: Learning Dexterous Manipulation from Scratch
How Object learns to manipulate physical objects with human-like dexterity — without human demonstrations.
Object's First Steps: Learning Dexterous Manipulation from Scratch
Most approaches to robotic manipulation begin with human demonstrations. A person teleoperates a robot arm through a task — grasping a mug, turning a valve, threading a cable — and the system learns to imitate. This works, up to a point. The robot acquires a narrow behavioral repertoire bounded by what the demonstrator thought to show it. At Webbeon, we took a different path with Object. We wanted manipulation policies that emerge from the physics of interaction itself, not from the biases of human motor habits. The result is a system that discovers grasping strategies no human engineer would design — and that generalizes to object geometries it has never encountered.
Zero-Shot Motor Control Through Reinforcement Learning
Object's manipulation stack is trained entirely through reinforcement learning in high-fidelity physics simulation. The agent controls a 24-degree-of-freedom hand mounted on a 7-DOF arm, receiving reward signals tied to task completion rather than trajectory matching. The critical insight is the reward structure: we define what constitutes success (the bolt is threaded, the object is reoriented to the target pose) without specifying how to achieve it. Over billions of simulated episodes, Object discovers manipulation strategies — fingertip pivots, controlled sliding, multi-contact stabilization — that are physically efficient but often unlike any human grasp taxonomy. We use a curriculum that begins with simplified contact models and progressively introduces friction anisotropy, deformable surfaces, and variable mass distributions. The policy network itself is a transformer architecture operating over proprioceptive state and tactile embeddings at 200 Hz, fast enough for the reactive adjustments that dexterous manipulation demands.
The Role of Tactile Sensing
Vision alone is insufficient for manipulation. When fingers close around an object, occlusion is total — the hand blocks the camera's view of exactly the contact geometry that matters. Object addresses this through dense tactile sensing: 192 taxels per fingertip, each reporting normal force, shear, and vibration at 1 kHz. The tactile signal is processed by a dedicated encoder that produces a compact contact-state representation fused with the proprioceptive stream. This is what allows Object to perform tasks that require force modulation — holding a raw egg without cracking it, then moments later torquing a stubborn lid. During training, we found that tactile-enabled policies converge to stable grasps 3.2x faster than vision-only baselines and exhibit far fewer catastrophic drops during transfer to physical hardware. Vibration sensing proved unexpectedly valuable: Object learns to detect incipient slip — the micro-vibrations that precede a grasp failure — and preemptively tightens its grip or shifts contact points before the object moves.
Emergent Strategies and Generalization
The most striking outcome of demonstration-free training is the manipulation strategies that emerge. For reorienting small objects, Object consistently discovers a "finger-gaiting" behavior where it walks the object across its fingertips in a coordinated sequence, maintaining three-point contact at all times. For heavy cylindrical objects, it develops a palm-bracing strategy that distributes load across the thenar surface rather than relying on fingertip pinch force. None of these strategies were programmed or demonstrated — they fall out of the optimization process as solutions to the underlying physics. Generalization follows naturally: because the policy is trained across randomized object shapes, masses, and friction coefficients, it develops an implicit understanding of manipulation physics rather than memorizing object-specific routines. In our benchmark suite covering 147 household objects, Object achieves an 89% first-attempt grasp success rate on objects entirely outside the training distribution, compared to 61% for the strongest imitation-learning baseline.
From Simulation to the Physical World
Transferring these policies to real hardware introduces challenges we address in detail in our sim-to-real work, but the manipulation results deserve their own accounting. On the physical Object platform, we evaluated dexterous tasks across three difficulty tiers: basic pick-and-place, tool use (screwdriver insertion, scissors operation), and assembly (multi-part snap-fit connectors). Basic manipulation succeeds at rates within 4% of simulation performance. Tool use shows a wider gap — approximately 12% — primarily attributable to contact dynamics around rigid tool-object interfaces that our simulator models imperfectly. Assembly tasks remain the hardest frontier, with a 23% performance gap driven by the compounding of small pose errors across sequential operations. Each of these gaps is a research signal, pointing precisely to where our simulation fidelity must improve. What is already clear is that the demonstration-free approach produces policies with a qualitatively different character: they are robust to perturbation, adaptive to unexpected object properties, and capable of recovery behaviors that no human thought to demonstrate.