Speech ML

2026-02-21

Deep learning has transformed speech processing across enhancement, separation, synthesis, and recognition. This topic tracks the evolution of speech ML across several interrelated threads: speech enhancement methods from masking-based to generative and language-model-based approaches, packet loss concealment, speech quality evaluation metrics, neural audio codecs, and the emerging application of reinforcement learning and preference optimization to speech models.

Speech enhancement has progressed from early GAN-based approaches (SEGAN) through complex-domain models (DCCRN), metric-guided training (MetricGAN+), and diffusion-based generation (CDiffuSE, SGMSE+). More recently, language-model-based methods that leverage discrete speech tokens (SELM, LLaSE-G1) and unified generative architectures (AnyEnhance, UniFlow) are opening new frontiers in general speech restoration and enhancement.

The broader speech ML ecosystem relies on shared infrastructure: neural audio codecs (LPCNet, DAC) for efficient speech representation, standardized quality metrics (PESQ, DNSMOS, PLCMOS) for evaluation, and large-scale datasets and challenges (DNS Challenge, MUSAN, WHAM!) for benchmarking. Speech recognition (Whisper) and synthesis (Llasa) continue to advance as foundation models scale.

Preference optimization for speech is a nascent but fast-growing direction, catalyzed by the success of RLHF and DPO in large language models. SpeechAlign (2024) was among the first to apply preference alignment to codec language models for TTS, and the field has since expanded to include diffusion-guided RL (DLPO), direct metric optimization (DMOSpeech), fine-grained token-level preference optimization (FPO), and industry-scale applications (Koel-TTS). The two threads converge as generative and LM-based speech architectures naturally enable preference-based training paradigms.

Speech Enhancement

SEGAN — SEGAN: Speech Enhancement Generative Adversarial Network

03/2017 · Universitat Politecnica de Catalunya · pascual2017segan

First end-to-end GAN operating directly on raw speech waveforms for speech enhancement, using an encoder-decoder generator with skip connections.

Speech ML

Speech Enhancement

SEGAN — SEGAN: Speech Enhancement Generative Adversarial Network

Conv-TasNet — Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Demucs-SE — Real Time Speech Enhancement in the Waveform Domain

DCCRN — DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

MetricGAN+ — MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

VoiceFixer — VoiceFixer: Toward General Speech Restoration With Neural Vocoder

CDiffuSE — Conditional Diffusion Probabilistic Model for Speech Enhancement

SGMSE+ — Speech Enhancement and Dereverberation with Diffusion-Based Generative Models

TEA-PSE 3.0 — TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge

NPU-Elevoc PSE — The NPU-Elevoc Personalized Speech Enhancement System for ICASSP2023 DNS Challenge

SELM — SELM: Speech Enhancement Using Discrete Tokens and Language Models

MaskSR — MaskSR: Masked Language Model for Full-Band Speech Restoration

AnyEnhance — AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

LLaSE-G1 — LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

UniFlow — UniFlow: Unifying Speech Front-End Tasks via Continuous Generative Modeling

SEFlow — Towards a Flexible and Unified Architecture for Speech Enhancement

RL/Preference Optimization for Speech Models

SpeechAlign — SpeechAlign: Aligning Speech Generation to Human Preferences

DLPO — DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models

UNO — Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

RIO — Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Emo-DPO — Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

DMOSpeech — DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

FPO — Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech

Koel-TTS — Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Packet Loss Concealment

PLC Challenge — INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge

BS-PLCNet — BS-PLCNet: Band-Split Packet Loss Concealment Network with Multi-Task Learning Framework and Multi-Discriminators

Speech Quality Evaluation

PESQ — Perceptual Evaluation of Speech Quality (PESQ) -- A New Method for Speech Quality Assessment of Telephone Networks and Codecs

P.808 — An Open Source Implementation of ITU-T Recommendation P.808 with Validation

DNSMOS — DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors

PLCMOS — PLCMOS -- a Data-Driven Non-Intrusive Metric for the Evaluation of Packet Loss Concealment Algorithms

SpeechBERTScore — SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

Neural Audio Codecs

LPCNet — LPCNet: Improving Neural Speech Synthesis Through Linear Prediction

DAC — High-Fidelity Audio Compression with Improved RVQGAN

Datasets, Corpora & Challenges

DEMAND — The Diverse Environments Multi-Channel Acoustic Noise Database (DEMAND): A Database of Multichannel Environmental Noise Recordings

MUSAN — MUSAN: A Music, Speech, and Noise Corpus

RIR Augmentation — A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition

WHAM! — WHAM!: Extending Speech Separation to Noisy Environments

DNS Challenge 2020 — The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results

DNS Challenge 2023 — ICASSP 2023 Deep Noise Suppression Challenge

HiFiTTS-2 — HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset

Speech Recognition

Whisper — Robust Speech Recognition via Large-Scale Weak Supervision

Speech Synthesis

Llasa — Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis