Anonymous Submission · Under Review

Data and Learning Where it Matters for Contact-Rich Manipulation

Anonymous Author(s)

Affiliation and author identities withheld for double-blind review

We use expensive robot data only for the most challenging section of the task, and rely on non-robot data wherever possible. The result is a policy that is robust and generalizable, yet very precise.

Paper (PDF) Watch Video Code & Data Coming soon
All videos are sped up.
Long-horizon task

Using natural-language prompts, the robot solves a very long-horizon sequence of challenging high-precision tasks in a changing and cluttered scene — using robot data only where necessary to achieve the required precision. For the rest of the motion, we rely on simple motion planning including collision avoidance.

We never collected any data in this specific scene!

Abstract

Learned policies trained end-to-end on large datasets often remain brittle in high-precision tasks and struggle with generalization. We find that these limitations largely stem from a lack of structure and focus in data collection. Our key insight is to leverage dense data collection only for the critical, contact-rich segment of a task and to rely on traditional planning during simple free-space motion.

We propose an automated data-collection scheme combined with offline deep reinforcement learning for the critical segment — eliminating reliance on a teleoperator's skill and on online policy updates. Across four challenging real-world tasks, using only 2–2.5 h of autonomous data collection, we achieve an average success rate of 96%, compared to the strongest baseline at 55%. Notably, performance remains high in out-of-distribution scenarios where end-to-end approaches struggle.

1
Human demonstration to bootstrap
96%
Average success rate across four tasks
55%
Strongest baseline (end-to-end), same tasks
2–2.5h
Autonomous data collection per task

Robustness & Generalization

Robot data and learning are only used where absolutely required to achieve the necessary precision for our tasks. This approach allows the policy to achieve the required precision while generalizing to entirely novel scenes — settings where end-to-end policies typically fail.

Our method generalizes across dimensions: scene-level distractors (objects and backgrounds), target object location, pick-up object location, collision avoidance, dynamic grasping from a human hand, and sequential task compositions.

Lego stacking · OOD

Cluttered scene, distractors

The workspace is filled with unseen objects and extra bricks — the policy still finds and stacks the target precisely.

Lego stacking · OOD

Picking from a human hand

The policy works robustly when picking up the part from a moving human hand.

Shelf stocking · OOD

Scene changes

A different table, lighting, and a mix of novel grocery items and distractors — the items are still stowed correctly.

Fan-cover insertion · OOD

Distractor objects & clutter

Surrounded by cables, boxes and stray parts, the policy still aligns and seats the cover with tight clearance.

Fan-cover insertion · OOD

Part handed over

A human hands the part to the robot at a novel pose — pose estimation re-localizes and the insertion succeeds.

Cluttered scene · Planning

Collision avoidance

In a densely cluttered workspace, motion planning routes the arm around obstacles to reach each target collision-free.

Method

Single Demo, Autonomous Data Collection, Deployment

Method overview: Initialization from a single demonstration, autonomous data collection at the critical segment, and sequential deployment combining planning and learning.
Dense data collection & robot learning for the critical segment; the rest is solved without using any robot data.

Rollouts · Our Method

High success rates

The key is to leverage dense data collection at the difficult segment, so that the policy never goes out-of-distribution for that part of the task. Success rates below are measured over 50 trials per task.

In the paper, we show that learning with offline DRL from densely collected data is required to achieve high success rates!

Lego stacking

94% success (100% partial)

Precise brick-on-brick insertion with tight clearance.

Shelf stocking

98% success (100% partial)

Stowing a yellow cardboard salt box into a tightly packed box.

Fan-cover insertion

96% success (98% partial)

Aligning and seating a PBT/PC-molded part under tight tolerances.

(Almost) Autonomous Data Collection

No teleoperator required

We show that dense data collection through explorative offline DRL is required to achieve high success rates for our tasks. Operator intervention is only required for resetting the scene in some tasks.

All the policies above are deployed using data collected from the setups shown below!

Lego stacking

Self-collected interaction data

Shelf stocking

Self-collected interaction data

Fan-cover insertion

Self-collected interaction data

Single kinesthetic teaching demonstration

The scene is calibrated using a single human demonstration from kinesthetic teaching. Operator intervention during data collection is only required for resetting the scene for some tasks.

Baselines

Comparison to end-to-end

The same high-precision and out-of-distribution tasks cause strong end-to-end baselines to fail — they lack precision at the contact segment and break down under distractors. Further, our method naturally provides a success classification through Q-function evaluation.

It is not clear whether simple scaling of end-to-end data collection is practical or solves the performance gap!

Lego stacking · Baseline

Lacks precision

Lego stacking · Baseline

New object positions (OOD)

Lego stacking · Baseline

Added distractor bricks (OOD)

Lego stacking · Baseline

Clutter & novel objects (OOD)

Shelf stocking · Baseline

Novel item in shelf (OOD)

Fan-cover insertion · Baseline

Similar-looking distractor object (OOD)

For instance, end-to-end policies fail to correctly detect target objects and instead grasp a similar-looking object (bottom right). End-to-end policies need much training data to avoid such failure cases, whereas other models trained without any robot data — such as SAM3 for object detection — are far more robust by default.