SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

Qualitative audio examples.

Off-Screen Speaker Separation

2 Speakers (Speaker 2 off-screen)

Mixture
System Video Audio ASR Transcription
Mixture loading…
Speaker 1 (On-Screen)
System Audio ASR Transcription
Reference loading…
RAVSS loading…
Proposed loading…
Speaker 2 (Off-Screen)
System Audio ASR Transcription
Reference loading…
RAVSS loading…
Proposed loading…

3 Speakers (Speaker 3 off-screen)

Mixture
System Video Audio ASR Transcription
Mixture loading…
Speaker 1 (On-Screen)
System Audio ASR Transcription
Reference loading…
RAVSS loading…
Proposed loading…
Speaker 2 (On-Screen)
System Audio ASR Transcription
Reference loading…
RAVSS loading…
Proposed loading…
Speaker 3 (Off-Screen)
System Audio ASR Transcription
Reference loading…
RAVSS loading…
Proposed loading…