DRES Benchmark: Evaluating LLMs for Disfluency Removal

By Nova Solari | 2025-09-26_01-57-48

DRES Benchmark: Evaluating LLMs for Disfluency Removal

Disfluency removal is more than a stylistic polish—it’s a bridge to clearer communication. The DRES Benchmark (Disfluency Removal Evaluation Suite) provides a rigorous framework for testing how well large language models (LLMs) can identify and remove disfluencies without sacrificing meaning, tone, or important content. In real conversations, fillers, repetitions, and repairs can obscure intent; a robust benchmark helps researchers and practitioners compare models on their ability to clean up speech transcripts while preserving nuance.

What makes DRES distinctive

“Disfluency removal isn’t just deleting phrases; it’s about maintaining speaker intent and the natural flow of language.”

Benchmark design and methodology

The benchmark is built around two core components: data and evaluation. On the data side, DRES assembles paired examples consisting of the original disfluent transcript and a cleansed reference. These pairs come from:

Evaluation unfolds across multiple metrics that address different facets of performance:

Evaluation protocol emphasizes fairness and replicability. Holdout test sets ensure domain generalization, and a standardized prompting framework helps disentangle model capability from prompt quirks. Human evaluators focus on whether the cleaned transcript conveys the same information, preserves speaker stance, and sounds natural to listeners.

Models and prompting strategies

Participants often benchmark a spectrum of LLMs, from open-weight models to proprietary systems. Common approaches include:

Leading LLMs, such as contemporary GPT-family models and other high-capacity assistants, are typically tested alongside smaller, open models to gauge the trade-offs between performance, latency, and cost. Prompt design often matters as much as model size—careful instructions about preserving meaning, handling paraphrase, and avoiding unintended edits can shift outcomes significantly.

Interpreting results and practical takeaways

Across domains, results tend to reveal a trade-off: stronger models excel at removing disfluencies cleanly while retaining nuance, but they can occasionally over-edit, removing content that participants would deem meaningful. A few practical patterns emerge:

Practical takeaway: when integrating disfluency removal into workflows, start with a strong planning step that defines what “clean” means for your domain, pair a capable model with careful prompting, and validate outputs with human reviewers where accuracy is critical.

Challenges and avenues for improvement

Disfluency patterns can be highly idiosyncratic, and prosody—the rhythm and intonation of speech—often guides meaning. While text-only benchmarks are valuable, future iterations of DRES may explore richer, multimodal representations or user-centric metrics that account for listener perception and task success. Additionally, cross-lingual disfluency handling remains an open frontier, as speech patterns and repairs vary widely across languages.

Future directions

As the field advances, DRES aims to serve as a clear, bipartisan benchmark for comparing how well LLMs can transform messy, real-world talk into clean, faithful transcripts—without losing the voice, intent, or information that matter most.