ICRA Workshop 2026
Dreaming the Sound of Contact
Leveraging video and audio generation for zero-shot force-aware manipulation.
Abstract
Audio is the force signal video can't see.
Recent advances in video generation enable learning robot manipulation trajectories from generated videos. However, these approaches produce purely kinematic trajectories that lack force information, leading to failure in contact-rich tasks where appropriate contact forces are essential for success. Generated audio carries a complementary and underexplored signal: contact sounds encode force dynamics that video alone cannot capture.
We present a pipeline that jointly leverages generated video and audio to recover both motion trajectories and contact force profiles from a single task description. We execute these force-aware trajectories on a Franka Panda robot using a closed-loop force regulator that tracks the audio-derived force profile during contact. Real-robot experiments on whiteboard wiping, carrot peeling, and lamp button pressing demonstrate that our force-aware pipeline enables successful contact-rich manipulation from video generation where a kinematic-only baseline fails.
Pipeline
Video for motion, audio for force.
Our vision pipeline segments objects (SAM 2), estimates depth (Depth Pro), and tracks 3D points (SpaTracker) to detect contacts and infer force directions. The audio pipeline extracts loudness as a force-magnitude proxy. A 1 kHz impedance controller closes the loop on the audio-derived force profile.
Experiments
Three contact-rich tasks.
Cite
@inproceedings{dreaming2026sound,
title = {Dreaming the Sound of Contact: Leveraging Video and Audio Generation
for Zero-Shot Force-Aware Manipulation},
author = {TODO},
booktitle = {ICRA Workshop},
year = {2026}
}