SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

Qualitative audio examples.

3-Speaker Separation

DCASE noise with three active speakers

Mixture
System Video Audio ASR Transcription
Mixture loading…
Speaker 1
System Video Audio ASR Transcription
Reference loading…
FlowAvse loading…
RAVSS loading…
Proposed loading…
Speaker 2
System Video Audio ASR Transcription
Reference loading…
FlowAvse loading…
RAVSS loading…
Proposed loading…
Speaker 3
System Video Audio ASR Transcription
Reference loading…
FlowAvse loading…
RAVSS loading…
Proposed loading…

DNS noise with three active speakers

Mixture
System Video Audio ASR Transcription
Mixture loading…
Speaker 1
System Audio ASR Transcription
Reference loading…
FlowAvse loading…
FlowAvse1 loading…
Trained on 1 speaker
RAVSS loading…
Proposed loading…
Speaker 2
System Audio ASR Transcription
Reference loading…
FlowAvse loading…
FlowAvse1 loading…
Trained on 1 speaker
FlowAvse2 loading…
Trained on 2 speakers
RAVSS loading…
Proposed loading…
Speaker 3
System Audio ASR Transcription
Reference loading…
FlowAvse loading…
FlowAvse1 loading…
Trained on 1 speaker
FlowAvse2 loading…
Trained on 2 speakers
RAVSS loading…
Proposed loading…