ICRA Workshop 2026

Dreaming the Sound of Contact

Leveraging video and audio generation for zero-shot force-aware manipulation.

Authors · Affiliation

Read the Paper arXiv Code Watch

Concept overview: three stacked panels showing generated video and audio, a robot contact point on a whiteboard with force-direction and audio-magnitude annotations, and three real-world execution tasks — wiping, peeling, and lamp pressing.

Abstract

Audio is the force signal video can't see.

Recent advances in video generation enable learning robot manipulation trajectories from generated videos. However, these approaches produce purely kinematic trajectories that lack force information, leading to failure in contact-rich tasks where appropriate contact forces are essential for success. Generated audio carries a complementary and underexplored signal: contact sounds encode force dynamics that video alone cannot capture.

We present a pipeline that jointly leverages generated video and audio to recover both motion trajectories and contact force profiles from a single task description. We execute these force-aware trajectories on a Franka Panda robot using a closed-loop force regulator that tracks the audio-derived force profile during contact. Real-robot experiments on whiteboard wiping, carrot peeling, and lamp button pressing demonstrate that our force-aware pipeline enables successful contact-rich manipulation from video generation where a kinematic-only baseline fails.

Pipeline

Video for motion, audio for force.

Our vision pipeline segments objects (SAM 2), estimates depth (Depth Pro), and tracks 3D points (SpaTracker) to detect contacts and infer force directions. The audio pipeline extracts loudness as a force-magnitude proxy. A 1 kHz impedance controller closes the loop on the audio-derived force profile.

Pipeline diagram: vision branch segments, tracks, and estimates force direction; audio branch extracts loudness; the combined force-aware trajectory runs on a Franka Panda with closed-loop force regulation.

Experiments

Three contact-rich tasks.

Task 1

Whiteboard wiping

Sustained downward pressure while translating laterally.

6 / 6 ours 0 / 6 base

Audio loudness (orange bars) and measured contact force (red line) during whiteboard wiping.

Ours · force-aware

Coming soon

Baseline · kinematic only

Coming soon

Task 2

Carrot peeling

Sustained normal force against a curved surface during drag.

5 / 6 ours 1 / 6 base

Audio loudness (orange) ramps up as the peeler engages; measured force (red) tracks it.

Ours · force-aware

Coming soon

Baseline · kinematic only

Coming soon

Task 3

Lamp button pressing

Impulsive force to overcome a spring-loaded button.

4 / 6 ours 0 / 6 base

A sharp audio peak at contact; measured force shows the impulsive push.

Ours · force-aware

Coming soon

Baseline · kinematic only

Coming soon

Cite

@inproceedings{dreaming2026sound,
  title     = {Dreaming the Sound of Contact: Leveraging Video and Audio Generation
               for Zero-Shot Force-Aware Manipulation},
  author    = {TODO},
  booktitle = {ICRA Workshop},
  year      = {2026}
}