SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

Qualitative audio examples.

2-Speaker Separation

DCASE noise with two active speakers

Mixture
System Video Audio ASR Transcription
Mixture loading…
Speaker 1
System Video Audio ASR Transcription
Reference loading…
FlowAvse loading…
RAVSS loading…
Proposed loading…
Speaker 2
System Video Audio ASR Transcription
Reference loading…
FlowAvse loading…
RAVSS loading…
Proposed loading…

DNS noise with two active speakers, SNR=0dB

Mixture
System Video Audio ASR Transcription
Mixture loading…
Speaker 1
System Audio ASR Transcription
Reference loading…
FlowAvse loading…
FlowAvse1 loading…
Trained on 1 speaker
FlowAvse3 loading…
Trained on 3 speakers
AVuDiffSE loading…
DPS loading…
RAVSS loading…
Proposed loading…
Speaker 2
System Audio ASR Transcription
Reference loading…
FlowAvse loading…
FlowAvse1 loading…
Trained on 1 speaker
FlowAvse3 loading…
Trained on 3 speakers
AVuDiffSE loading…
DPS loading…
RAVSS loading…
Proposed loading…

DNS noise with two active speakers,SNR=5dB

Mixture
System Video Audio ASR Transcription
Mixture loading…
Speaker 1
System Audio ASR Transcription
Reference loading…
FlowAvse loading…
FlowAvse1 loading…
Trained on 1 speaker
FlowAvse3 loading…
Trained on 3 speakers
AVuDiffSE loading…
DPS loading…
RAVSS loading…
Proposed loading…
Speaker 2
System Audio ASR Transcription
Reference loading…
FlowAvse loading…
FlowAvse1 loading…
Trained on 1 speaker
FlowAvse3 loading…
Trained on 3 speakers
AVuDiffSE loading…
DPS loading…
RAVSS loading…
Proposed loading…