Deep learning has transformed speech processing across enhancement, separation, synthesis, and recognition. This topic tracks the evolution of speech ML across several interrelated threads: speech enhancement methods from masking-based to generative and language-model-based approaches, packet loss concealment, speech quality evaluation metrics, neural audio codecs, and the emerging application of reinforcement learning and preference optimization to speech models.
Speech enhancement has progressed from early GAN-based approaches (SEGAN) through complex-domain models (DCCRN), metric-guided training (MetricGAN+), and diffusion-based generation (CDiffuSE, SGMSE+). More recently, language-model-based methods that leverage discrete speech tokens (SELM, LLaSE-G1) and unified generative architectures (AnyEnhance, UniFlow) are opening new frontiers in general speech restoration and enhancement.
The broader speech ML ecosystem relies on shared infrastructure: neural audio codecs (LPCNet, DAC) for efficient speech representation, standardized quality metrics (PESQ, DNSMOS, PLCMOS) for evaluation, and large-scale datasets and challenges (DNS Challenge, MUSAN, WHAM!) for benchmarking. Speech recognition (Whisper) and synthesis (Llasa) continue to advance as foundation models scale.
Preference optimization for speech is a nascent but fast-growing direction, catalyzed by the success of RLHF and DPO in large language models. SpeechAlign (2024) was among the first to apply preference alignment to codec language models for TTS, and the field has since expanded to include diffusion-guided RL (DLPO), direct metric optimization (DMOSpeech), fine-grained token-level preference optimization (FPO), and industry-scale applications (Koel-TTS). The two threads converge as generative and LM-based speech architectures naturally enable preference-based training paradigms.
@inproceedings{pascual2017segan,
title = {{SEGAN: Speech Enhancement Generative Adversarial Network}},
author = {Pascual, S. and Bonafonte, A. and Serra, J.},
booktitle = {{Proc. Interspeech}},
year = {2017}
}
@article{luo2019convtasnet,
title = {{Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation}},
author = {Luo, Y. and Mesgarani, N.},
journal = {{IEEE/ACM Transactions on Audio, Speech, and Language Processing}},
volume = {27},
number = {8},
pages = {1256--1266},
year = {2019}
}
Demucs-SE
— Real Time Speech Enhancement in the Waveform Domain
06/2020 · Facebook AI Research · defossez2020real
Causal encoder-decoder model based on the Demucs architecture that performs real-time speech enhancement directly on raw waveforms, running on a single laptop CPU core.
@inproceedings{defossez2020real,
title = {{Real Time Speech Enhancement in the Waveform Domain}},
author = {Defossez, A. and Synnaeve, G. and Adi, Y.},
booktitle = {{Proc. Interspeech}},
year = {2020}
}
DCCRN
— DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
08/2020 · Northwestern Polytechnical University · hu2020dccrn
Complex-valued convolution recurrent network that jointly enhances magnitude and phase of speech, ranking first in the Interspeech 2020 DNS Challenge real-time track.
@inproceedings{hu2020dccrn,
title = {{DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement}},
author = {Hu, Y. and Liu, Y. and Lv, S. and Xing, M. and Zhang, S. and Fu, Y. and others},
booktitle = {{Proc. Interspeech}},
year = {2020}
}
MetricGAN+
— MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
04/2021 · Academia Sinica · fu2021metricgan
Improves MetricGAN with domain-aware training techniques to directly optimize non-differentiable evaluation metrics like PESQ via a learned surrogate, achieving state-of-the-art results on VoiceBank-DEMAND.
@inproceedings{fu2021metricgan,
title = {{MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement}},
author = {Fu, S.-W. and Yu, C. and Hsieh, T.-A. and Plantinga, P. and Ravanelli, M. and Lu, X. and others},
booktitle = {{Proc. Interspeech}},
year = {2021}
}
VoiceFixer
— VoiceFixer: Toward General Speech Restoration With Neural Vocoder
09/2021 · ByteDance · liu2021voicefixer
Two-stage general speech restoration framework using a ResUNet analysis stage and neural vocoder synthesis stage to jointly remove noise, reverberation, clipping, and low-resolution artifacts.
@misc{liu2021voicefixer,
title = {{VoiceFixer: Toward General Speech Restoration With Neural Vocoder}},
author = {Liu, H. and Kong, Q. and Tian, Q. and Zhao, Y. and Wang, D. and Huang, C. and others},
year = {2021},
eprint = {2109.13731},
archivePrefix = {arXiv}
}
CDiffuSE
— Conditional Diffusion Probabilistic Model for Speech Enhancement
02/2022 · Carnegie Mellon University · lu2022conditional
Adapts diffusion probabilistic models for speech enhancement by conditioning the reverse diffusion process on noisy speech, enabling adaptation to non-Gaussian real-world noise.
@inproceedings{lu2022conditional,
title = {{Conditional Diffusion Probabilistic Model for Speech Enhancement}},
author = {Lu, Y.-J. and Wang, Z.-Q. and Watanabe, S. and Richard, A. and Yu, C. and Tsao, Y.},
booktitle = {{Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2022}
}
SGMSE+
— Speech Enhancement and Dereverberation with Diffusion-Based Generative Models
08/2022 · Universitat Hamburg · richter2023speech
Score-based diffusion model that starts the reverse process from a mixture of noisy speech and Gaussian noise, achieving strong enhancement and dereverberation with only 30 diffusion steps.
@article{richter2023speech,
title = {{Speech Enhancement and Dereverberation with Diffusion-Based Generative Models}},
author = {Richter, J. and Welker, S. and Lemercier, J.-M. and Lay, B. and Gerkmann, T.},
journal = {{IEEE/ACM Transactions on Audio, Speech, and Language Processing}},
volume = {31},
pages = {2351--2364},
year = {2023}
}
TEA-PSE 3.0
— TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge
06/2023 · Tencent · ju2023teapse
Upgrades TEA-PSE with residual LSTM, local-global speaker representation, and multi-STFT loss to rank 1st in both tracks of the ICASSP 2023 DNS Challenge.
@inproceedings{ju2023teapse,
title = {{TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge}},
author = {Ju, Y. and Chen, J. and Zhang, S. and He, S. and Rao, W. and Zhu, W. and others},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2023}
}
NPU-Elevoc PSE
— The NPU-Elevoc Personalized Speech Enhancement System for ICASSP2023 DNS Challenge
06/2023 · Northwestern Polytechnical University · yan2023npuelevoc
Builds on TEA-PSE 2.0 with improved speaker-embedding fusion, adversarial training, and multi-scale loss, tying for 1st in the headset track of the ICASSP 2023 DNS Challenge.
@inproceedings{yan2023npuelevoc,
title = {{The NPU-Elevoc Personalized Speech Enhancement System for ICASSP2023 DNS Challenge}},
author = {Yan, X. and Yang, Y. and Guo, Z. and Peng, L. and Xie, L.},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2023}
}
SELM
— SELM: Speech Enhancement Using Discrete Tokens and Language Models
12/2023 · Northwestern Polytechnical University · wang2024selm
Introduces a three-stage speech enhancement paradigm (encoding, modeling, decoding) that tokenizes speech into discrete SSL tokens and uses a language model to capture contextual semantics for enhancement.
@inproceedings{wang2024selm,
title = {{SELM: Speech Enhancement Using Discrete Tokens and Language Models}},
author = {Wang, Z. and Zhu, X. and Zhang, Z. and Lv, Y. and Jiang, N. and Zhao, G. and others},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2024}
}
MaskSR
— MaskSR: Masked Language Model for Full-Band Speech Restoration
06/2024 · Dolby Laboratories · li2024masksr
Extends masked generative modeling to full-band 44.1 kHz speech restoration using discrete codec tokens, jointly addressing noise, reverberation, clipping, and bandwidth limitations via iterative sampling.
@inproceedings{li2024masksr,
title = {{MaskSR: Masked Language Model for Full-Band Speech Restoration}},
author = {Li, X. and Wang, Q. and Liu, X.},
booktitle = {{Interspeech}},
year = {2024}
}
AnyEnhance
— AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement
01/2025 · CUHK-Shenzhen · zhang2025anyenhance
Unified masked generative model for speech and singing voice enhancement that introduces prompt-guided in-context learning for reference-based enhancement and a self-critic mechanism for iterative quality refinement.
@article{zhang2025anyenhance,
title = {{AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement}},
author = {Zhang, J. and Yang, J. and Fang, Z. and Wang, Y. and Zhang, Z. and Wang, Z. and others},
journal = {{IEEE Transactions on Audio, Speech and Language Processing}},
volume = {33},
pages = {3085--3098},
year = {2025}
}
LLaSE-G1
— LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
03/2025 · Northwestern Polytechnical University · kang2025llase
LLaMA-based SE model using continuous WavLM inputs and X-Codec2 token outputs with dual-channel I/O that unifies multiple enhancement tasks without task IDs, demonstrating scaling effects and emergent capabilities on unseen tasks.
@inproceedings{kang2025llase,
title = {{LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement}},
author = {Kang, B. and Zhu, X. and Zhang, Z. and Ye, Z. and Liu, M. and Wang, Z. and others},
booktitle = {{Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)}},
year = {2025}
}
08/2025 · Northwestern Polytechnical University · wang2025uniflow
Unified latent-space framework using a waveform VAE and Diffusion Transformer with learnable task-condition embeddings to address SE, target speaker extraction, echo cancellation, and source separation under one model.
@misc{wang2025uniflow,
title = {{UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling}},
author = {Wang, Z. and Liu, Z. and Zhu, Y. and Li, X. and Kang, B. and Yao, J. and others},
year = {2025},
eprint = {2508.07558},
archivePrefix = {arXiv}
}
SEFlow
— Towards a Flexible and Unified Architecture for Speech Enhancement
11/2025 · Northwestern Polytechnical University · feng2025towards
Proposes a single dynamically-sliceable network with FlexAttention and FlexRMSNorm that scales in width and depth across device-edge-cloud constraints, achieving competitive SE performance even at 1% subnetwork size.
@article{feng2025towards,
title = {{Towards a Flexible and Unified Architecture for Speech Enhancement}},
author = {Feng, L. and Zhang, C. and Zhang, X.-L.},
journal = {{Vicinagearth}},
volume = {2},
year = {2025}
}
RL/Preference Optimization for Speech Models
SpeechAlign
— SpeechAlign: Aligning Speech Generation to Human Preferences
04/2024 · Fudan University · zhang2024speechalign
First work to align codec language models to human preferences via iterative DPO on preference codec datasets contrasting golden vs. synthetic tokens.
@inproceedings{zhang2024speechalign,
title = {{SpeechAlign: Aligning Speech Generation to Human Preferences}},
author = {Zhang, D. and Li, Z. and Li, S. and Zhang, X. and Wang, P. and Zhou, Y. and others},
booktitle = {{Advances in Neural Information Processing Systems (NeurIPS)}},
year = {2024}
}
DLPO
— DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models
05/2024 · Ohio State University · chen2025dlpo
Proposes diffusion loss-guided policy optimization that integrates the original training loss into the RL reward to fine-tune TTS diffusion models with RLHF.
@inproceedings{chen2025dlpo,
title = {{DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models}},
author = {Chen, J. and Byun, J.-S. and Elsner, M. and Perrault, A.},
booktitle = {{Proc. Interspeech}},
year = {2025}
}
UNO
— Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
06/2024 · Nanyang Technological University · chen2024enhancing
Introduces uncertainty-aware optimization (UNO) that directly maximizes TTS utility while accounting for subjective evaluation uncertainty, without needing a reward model or preference data.
@misc{chen2024enhancing,
title = {{Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback}},
author = {Chen, C. and Hu, Y. and Wu, W. and Wang, H. and Chng, E. S. and Zhang, C.},
year = {2024},
eprint = {2406.00654},
archivePrefix = {arXiv}
}
RIO
— Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
07/2024 · Nanyang Technological University · hu2024robust
Proposes reverse inference optimization using a Bayesian self-preference function where good TTS output should reversely reconstruct the original prompt, enabling preference alignment without human annotations.
@misc{hu2024robust,
title = {{Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization}},
author = {Hu, Y. and Chen, C. and Wang, S. and Chng, E. S. and Zhang, C.},
year = {2024},
eprint = {2407.02243},
archivePrefix = {arXiv}
}
Emo-DPO
— Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization
09/2024 · A*STAR · gao2025emodpo
Applies DPO to emotional TTS by optimizing towards preferred emotions over less preferred ones, enabling nuanced control over emotional expressiveness.
@inproceedings{gao2025emodpo,
title = {{Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization}},
author = {Gao, X. and Zhang, C. and Chen, Y. and Zhang, H. and Chen, N. F.},
booktitle = {{Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2025}
}
DMOSpeech
— DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
10/2024 · Columbia University · li2025dmospeech
Achieves end-to-end differentiable metric optimization (CTC + speaker verification loss) in TTS by distilling a diffusion model to 4 steps, enabling direct reward signal backpropagation.
@inproceedings{li2025dmospeech,
title = {{DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis}},
author = {Li, Y. A. and Kumar, R. and Jin, Z.},
booktitle = {{Proc. International Conference on Machine Learning (ICML)}},
year = {2025}
}
02/2025 · Northwestern Polytechnical University · yao2025finegrained
Proposes fine-grained preference optimization that annotates and optimizes at the token level for specific error segments rather than whole utterances, improving TTS robustness with superior data efficiency.
@misc{yao2025finegrained,
title = {{Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech}},
author = {Yao, J. and Yang, Y. and Pan, Y. and Feng, Y. and Ning, Z. and Ye, J. and others},
year = {2025},
eprint = {2502.02950},
archivePrefix = {arXiv}
}
Koel-TTS
— Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
02/2025 · NVIDIA · hussain2025koeltts
Combines ASR/speaker-verification-guided preference alignment with classifier-free guidance to improve intelligibility, speaker similarity, and naturalness of LLM-based TTS.
@inproceedings{hussain2025koeltts,
title = {{Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance}},
author = {Hussain, S. S. and Neekhara, P. and Yang, X. and Casanova, E. and Ghosh, S. and Fejgin, R. and others},
booktitle = {{Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP)}},
year = {2025}
}
Packet Loss Concealment
PLC Challenge
— INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge
09/2022 · Microsoft · diener2022interspeech
Introduces the first open PLC challenge with a public dataset, evaluation framework, and the PLCMOS metric for benchmarking deep packet loss concealment systems.
@inproceedings{diener2022interspeech,
title = {{INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge}},
author = {Diener, L. and Sootla, S. and Branets, S. and Saabas, A. and Aichner, R. and Cutler, R.},
booktitle = {{Interspeech}},
year = {2022}
}
BS-PLCNet
— BS-PLCNet: Band-Split Packet Loss Concealment Network with Multi-Task Learning Framework and Multi-Discriminators
04/2024 · Northwestern Polytechnical University · zhang2024bsplcnet
Splits fullband audio into wide-band and high-band streams with separate GCRN and GRU networks, adding f0 prediction and linguistic-awareness multi-task losses to tie for 1st in the ICASSP 2024 PLC Challenge.
@inproceedings{zhang2024bsplcnet,
title = {{BS-PLCNet: Band-Split Packet Loss Concealment Network with Multi-Task Learning Framework and Multi-Discriminators}},
author = {Zhang, Z. and Sun, J. and Xia, X. and Huang, C. and Xiao, Y. and Xie, L.},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing Workshops (ICASSPW)}},
year = {2024}
}
Speech Quality Evaluation
PESQ
— Perceptual Evaluation of Speech Quality (PESQ) -- A New Method for Speech Quality Assessment of Telephone Networks and Codecs
05/2001 · Psytechnics / KPN Research · rix2001perceptual
Introduces PESQ, the ITU-T P.862 standard for objective end-to-end speech quality assessment combining PAMS and PSQM99 perceptual models.
@inproceedings{rix2001perceptual,
title = {{Perceptual Evaluation of Speech Quality (PESQ) -- A New Method for Speech Quality Assessment of Telephone Networks and Codecs}},
author = {Rix, A. W. and Beerends, J. G. and Hollier, M. P. and Hekstra, A. P.},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2001}
}
P.808
— An Open Source Implementation of ITU-T Recommendation P.808 with Validation
10/2020 · Microsoft · naderi2020open
Provides a validated open-source crowdsourcing implementation of the ITU-T P.808 subjective speech quality assessment standard on Amazon Mechanical Turk.
@inproceedings{naderi2020open,
title = {{An Open Source Implementation of ITU-T Recommendation P.808 with Validation}},
author = {Naderi, B. and Cutler, R.},
booktitle = {{Interspeech}},
year = {2020}
}
DNSMOS
— DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors
06/2021 · Microsoft · reddy2021dnsmos
Proposes a non-intrusive DNN-based MOS predictor trained via self-teaching on DNS Challenge data, enabling scalable evaluation of noise suppression without clean references.
@inproceedings{reddy2021dnsmos,
title = {{DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors}},
author = {Reddy, C. K. A. and Gopal, V. and Cutler, R.},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2021}
}
PLCMOS
— PLCMOS -- a Data-Driven Non-Intrusive Metric for the Evaluation of Packet Loss Concealment Algorithms
08/2023 · Microsoft · diener2023plcmos
Introduces a non-intrusive neural MOS predictor specifically designed to evaluate packet loss concealment quality, trained on large-scale crowdsourced listening tests.
@inproceedings{diener2023plcmos,
title = {{PLCMOS -- a Data-Driven Non-Intrusive Metric for the Evaluation of Packet Loss Concealment Algorithms}},
author = {Diener, L. and Purin, M. and Sootla, S. and Saabas, A. and Aichner, R. and Cutler, R.},
booktitle = {{Interspeech}},
year = {2023}
}
09/2024 · University of Tokyo / CMU · saeki2024speechbertscore
Adapts BERTScore to self-supervised speech representations for reference-aware evaluation of speech generation, alongside discrete-token metrics like SpeechBLEU.
@inproceedings{saeki2024speechbertscore,
title = {{SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics}},
author = {Saeki, T. and Maiti, S. and Takamichi, S. and Watanabe, S. and Saruwatari, H.},
booktitle = {{Interspeech}},
year = {2024}
}
Neural Audio Codecs
LPCNet
— LPCNet: Improving Neural Speech Synthesis Through Linear Prediction
10/2018 · Google · valin2019lpcnet
Combines linear prediction with recurrent neural networks (WaveRNN variant) to achieve high-quality speech synthesis under 3 GFLOPS, enabling real-time operation on low-power devices.
@inproceedings{valin2019lpcnet,
title = {{LPCNet: Improving Neural Speech Synthesis Through Linear Prediction}},
author = {Valin, J.-M. and Skoglund, J.},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2019}
}
DAC
— High-Fidelity Audio Compression with Improved RVQGAN
06/2023 · Descript · kumar2023high
Achieves ~90x compression of 44.1 kHz audio into discrete tokens at 8 kbps by combining improved RVQGAN quantization techniques with better adversarial and reconstruction losses.
@inproceedings{kumar2023high,
title = {{High-Fidelity Audio Compression with Improved RVQGAN}},
author = {Kumar, R. and Seetharaman, P. and Luebs, A. and Kumar, I. and Kumar, K.},
booktitle = {{Advances in Neural Information Processing Systems (NeurIPS)}},
year = {2023}
}
Datasets, Corpora & Challenges
DEMAND
— The Diverse Environments Multi-Channel Acoustic Noise Database (DEMAND): A Database of Multichannel Environmental Noise Recordings
06/2013 · INRIA · thiemann2013diverse
Provides a freely available 16-channel noise corpus recorded across 18 diverse indoor and outdoor environments for multichannel noise-robust speech processing research.
@article{thiemann2013diverse,
title = {{The Diverse Environments Multi-Channel Acoustic Noise Database (DEMAND): A Database of Multichannel Environmental Noise Recordings}},
author = {Thiemann, J. and Ito, N. and Vincent, E.},
journal = {{Proceedings of Meetings on Acoustics}},
volume = {19},
year = {2013}
}
MUSAN
— MUSAN: A Music, Speech, and Noise Corpus
10/2015 · Johns Hopkins University · snyder2015musan
Introduces a free, multi-genre corpus of music, speech in 12 languages, and diverse noises designed for training voice activity detection and music/speech discrimination models.
@misc{snyder2015musan,
title = {{MUSAN: A Music, Speech, and Noise Corpus}},
author = {Snyder, D. and Chen, G. and Povey, D.},
year = {2015},
eprint = {1510.08484},
archivePrefix = {arXiv}
}
RIR Augmentation
— A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition
03/2017 · Johns Hopkins University · ko2017study
Demonstrates that simulated room impulse responses with point-source noises match real RIR performance for data augmentation, and publicly releases RIR and noise datasets for reverberation-robust ASR training.
@inproceedings{ko2017study,
title = {{A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition}},
author = {Ko, T. and Peddinti, V. and Povey, D. and Seltzer, M. L. and Khudanpur, S.},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2017}
}
WHAM!
— WHAM!: Extending Speech Separation to Noisy Environments
09/2019 · MERL · wichern2019wham
Introduces the WSJ0 Hipster Ambient Mixtures dataset that combines two-speaker mixtures with real ambient noise recorded in urban environments, enabling benchmarking of speech separation robustness to noise.
@inproceedings{wichern2019wham,
title = {{WHAM!: Extending Speech Separation to Noisy Environments}},
author = {Wichern, G. and Antognini, J. and Flynn, M. and Zhu, L. R. and McQuinn, E. and Crow, D. and others},
booktitle = {{Interspeech}},
year = {2019}
}
DNS Challenge 2020
— The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results
10/2020 · Microsoft · reddy2020interspeech
Establishes the first DNS challenge with a large-scale open-source noise/speech corpus and a crowdsourced ITU-T P.808 subjective evaluation framework for real-time single-channel speech enhancement.
@inproceedings{reddy2020interspeech,
title = {{The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results}},
author = {Reddy, C. K. A. and Gopal, V. and Cutler, R. and Beyrami, E. and Cheng, R. and Dubey, H. and others},
booktitle = {{Interspeech}},
year = {2020}
}
DNS Challenge 2023
— ICASSP 2023 Deep Noise Suppression Challenge
03/2024 · Microsoft · dubey2024icassp
Presents the fifth DNS challenge expanding to personalized noise suppression, joint denoising/dereverberation/interferer suppression, and separate headset and speakerphone tracks.
@article{dubey2024icassp,
title = {{ICASSP 2023 Deep Noise Suppression Challenge}},
author = {Dubey, H. and Aazami, A. and Gopal, V. and Naderi, B. and Braun, S. and Cutler, R. and others},
journal = {{IEEE Open Journal of Signal Processing}},
volume = {5},
pages = {725--737},
year = {2024}
}
HiFiTTS-2
— HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset
08/2025 · NVIDIA · langman2025hifitts
Introduces a 36.7k-hour high-bandwidth English speech dataset derived from LibriVox with a scalable processing pipeline for bandwidth estimation, segmentation, and quality filtering, enabling high-fidelity zero-shot TTS training at 44.1 kHz.
@inproceedings{langman2025hifitts,
title = {{HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset}},
author = {Langman, R. and Yang, X. and Neekhara, P. and Hussain, S. and Casanova, E. and Bakhturina, E. and others},
booktitle = {{Interspeech}},
year = {2025}
}
Speech Recognition
Whisper
— Robust Speech Recognition via Large-Scale Weak Supervision
12/2022 · OpenAI · radford2023robust
Trains an encoder-decoder Transformer on 680k hours of weakly-supervised multilingual audio, yielding a general-purpose speech recognizer competitive with supervised baselines without fine-tuning.
@inproceedings{radford2023robust,
title = {{Robust Speech Recognition via Large-Scale Weak Supervision}},
author = {Radford, A. and Kim, J. W. and Xu, T. and Brockman, G. and McLeavey, C. and Sutskever, I.},
booktitle = {{International Conference on Machine Learning (ICML)}},
year = {2023}
}
Speech Synthesis
Llasa
— Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
02/2025 · HKUST · ye2025llasa
Aligns TTS with standard LLM architectures by using a single-layer VQ codec and a plain Llama Transformer, showing that scaling both train-time and inference-time compute via verifier-guided search improves naturalness and expressiveness.
@misc{ye2025llasa,
title = {{Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis}},
author = {Ye, Z. and Zhu, X. and Chan, C. and Wang, X. and Tan, X. and Lei, J. and others},
year = {2025},
eprint = {2502.04128},
archivePrefix = {arXiv}
}