Qualitative audio examples.
This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers and noise, and show that despite being entirely unsupervised, our method consistently surpasses leading supervised baselines in terms of Word Error Rate (WER) across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Code and pretrained models will become available upon acceptance.
Please use the tabs above to navigate between enhancement and multi-speaker separation scenarios.