High-Fidelity Speech Enhancement via Discrete Audio Tokens

ICASSP 2026

L. A. Lanzendörfer, F. Berdoz, A. Asonitis, R. Wattenhofer

ETH Zurich, Switzerland

speech-enhancementaudio-tokenslanguage-modelsbandwidth-extension

Abstract

Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.

Overview

Overview of the DAC-SE1 framework showing how noisy audio is encoded into discrete DAC tokens, processed by a LLaMA-based autoregressive model, and decoded back into clean high-fidelity speech.
Figure 1: Overview of the DAC-SE1 framework. Previous methods use continuous speech representations (e.g., HuBERT or WavLM) as input and predict tokens from a Neural Speech Codec (NSC), limited to 16 kHz signals. DAC-SE1 operates directly on DAC tokens, compressing a 44.1 kHz signal into 9 codebook layers at 86 Hz framerate. The flattened token sequence is translated by a LLaMA-based model into clean speech tokens, which are then reconstructed using the DAC decoder.

Speech enhancement (SE) aims to recover clean speech from degraded recordings affected by noise, reverberation, or bandwidth limitations. While recent LM-based SE methods show promising results, they typically rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific enhancement. DAC-SE1 takes a different approach: a single-stage autoregressive language model operating directly on high-resolution discrete audio tokens (44.1 kHz), achieving high-fidelity speech enhancement without auxiliary encoders or multi-stage pipelines.

Key idea

DAC-SE1 encodes audio using the DAC codec into 9 residual codebooks, then flattens all codebooks into a single token sequence. A 1B-parameter LLaMA-based causal transformer maps noisy token sequences to clean ones. This design eliminates the need for semantic encoders, dual-channel conditioning, or separate noise estimators, relying instead on scaling laws to capture both fine-grained acoustic structure and long-range dependencies.

Training strategy

The model handles multiple distortion types (noise, reverberation, downsampling, packet loss) through a two-stage training strategy:

  1. Stage 1: Multi-task training across all distortion types simultaneously.
  2. Stage 2: Per-task fine-tuning, allowing each distortion type to optimize its own loss more effectively.

This approach produces better generalization across all distortions compared to joint training alone.

Results

Qualitative comparison of log-mel spectrograms between DAC-SE1 and previous autoregressive speech enhancement methods.
Figure 2: Qualitative comparison on log-mel spectrograms. DAC-SE1 cleans the signal without hallucinating artifacts or introducing spectral distortion, unlike prior autoregressive methods.

HiFiTTS-2 evaluation

ModelOVRLSIGBAKP808PESQS-BERTSPLCMOSWERMUSHRA
Noisy2.443.182.793.112.630.893.840.2535.8
Clean3.033.413.803.644.501.004.410.0094.5
LLaSE-G12.903.243.833.471.980.864.190.2744.1
VoiceFixer2.923.213.903.431.850.814.290.4534.5
DAC-SE1 (ours)2.953.333.703.562.460.894.350.2558.3

DAC-SE1 outperforms both LLaSE-G1 and VoiceFixer across the majority of objective metrics and achieves the highest MUSHRA score (58.3 vs. 44.1), confirming that human listeners consistently prefer its outputs.

Benchmark results

On the ICASSP 2022 PLC challenge, DAC-SE1 achieves the highest PLCMOS score (4.34), surpassing all baselines including LLaSE-G1 multi-task (4.30). On the ICASSP 2023 DNS challenge, the model performs competitively with strong baselines, confirming generalization to unseen noise profiles.

Key takeaway: A single autoregressive language model operating on high-resolution discrete audio tokens can achieve state-of-the-art speech enhancement without domain-specific architectural modifications, demonstrating that speech enhancement benefits from the same scaling laws that drive progress in NLP.

Citation

@inproceedings{lanzendorfer2026high,
  author = {Lanzend{\"o}rfer, L. A. and Berdoz, F. and Asonitis, A. and Wattenhofer, R.},
  title = {{High-Fidelity Speech Enhancement via Discrete Audio Tokens}},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
  year = {2026}
}