SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

Qualitative audio examples.

Enhancement

DCASE noise with one active speaker

Noisy
System Video Audio ASR Transcription
Noisey loading…
Enhanced
System Video Audio ASR Transcription
Reference loading…
FlowAvse loading…
RAVSS loading…
Proposed loading…

DNS noise with one active speaker

Noisy
System Video Audio ASR Transcription
Noisey loading…
Enhanced
System Video Audio ASR Transcription
Reference loading…
FlowAvse loading…
FlowAvse2 loading…
Trained on 2 speakers
FlowAvse3 loading…
Trained on 3 speakers
RAVSS loading…
Proposed loading…