High-Fidelity Speech Enhancement via Discrete Audio Tokens
ICASSP 2026
ETH Zurich, Switzerland
Abstract
Overview
.png)
Speech enhancement (SE) aims to recover clean speech from degraded recordings affected by noise, reverberation, or bandwidth limitations. While recent LM-based SE methods show promising results, they typically rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific enhancement. DAC-SE1 takes a different approach: a single-stage autoregressive language model operating directly on high-resolution discrete audio tokens (44.1 kHz), achieving high-fidelity speech enhancement without auxiliary encoders or multi-stage pipelines.
Key idea
DAC-SE1 encodes audio using the DAC codec into 9 residual codebooks, then flattens all codebooks into a single token sequence. A 1B-parameter LLaMA-based causal transformer maps noisy token sequences to clean ones. This design eliminates the need for semantic encoders, dual-channel conditioning, or separate noise estimators, relying instead on scaling laws to capture both fine-grained acoustic structure and long-range dependencies.
Training strategy
The model handles multiple distortion types (noise, reverberation, downsampling, packet loss) through a two-stage training strategy:
- Stage 1: Multi-task training across all distortion types simultaneously.
- Stage 2: Per-task fine-tuning, allowing each distortion type to optimize its own loss more effectively.
This approach produces better generalization across all distortions compared to joint training alone.
Results

HiFiTTS-2 evaluation
| Model | OVRL | SIG | BAK | P808 | PESQ | S-BERTS | PLCMOS | WER | MUSHRA |
|---|---|---|---|---|---|---|---|---|---|
| Noisy | 2.44 | 3.18 | 2.79 | 3.11 | 2.63 | 0.89 | 3.84 | 0.25 | 35.8 |
| Clean | 3.03 | 3.41 | 3.80 | 3.64 | 4.50 | 1.00 | 4.41 | 0.00 | 94.5 |
| LLaSE-G1 | 2.90 | 3.24 | 3.83 | 3.47 | 1.98 | 0.86 | 4.19 | 0.27 | 44.1 |
| VoiceFixer | 2.92 | 3.21 | 3.90 | 3.43 | 1.85 | 0.81 | 4.29 | 0.45 | 34.5 |
| DAC-SE1 (ours) | 2.95 | 3.33 | 3.70 | 3.56 | 2.46 | 0.89 | 4.35 | 0.25 | 58.3 |
DAC-SE1 outperforms both LLaSE-G1 and VoiceFixer across the majority of objective metrics and achieves the highest MUSHRA score (58.3 vs. 44.1), confirming that human listeners consistently prefer its outputs.
Benchmark results
On the ICASSP 2022 PLC challenge, DAC-SE1 achieves the highest PLCMOS score (4.34), surpassing all baselines including LLaSE-G1 multi-task (4.30). On the ICASSP 2023 DNS challenge, the model performs competitively with strong baselines, confirming generalization to unseen noise profiles.
Citation
@inproceedings{lanzendorfer2026high,
author = {Lanzend{\"o}rfer, L. A. and Berdoz, F. and Asonitis, A. and Wattenhofer, R.},
title = {{High-Fidelity Speech Enhancement via Discrete Audio Tokens}},
booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
year = {2026}
}