DRES Benchmark: Evaluating LLMs for Disfluency Removal

Disfluency removal is more than a stylistic polish—it’s a bridge to clearer communication. The DRES Benchmark (Disfluency Removal Evaluation Suite) provides a rigorous framework for testing how well large language models (LLMs) can identify and remove disfluencies without sacrificing meaning, tone, or important content. In real conversations, fillers, repetitions, and repairs can obscure intent; a robust benchmark helps researchers and practitioners compare models on their ability to clean up speech transcripts while preserving nuance.

What makes DRES distinctive

Multi-domain coverage: DRES emphasizes transcripts from diverse domains—academic lectures, customer-service calls, podcast conversations, and interview clips—to ensure models generalize beyond a single style.
Unified task framing: The benchmark treats disfluency removal as a sequence-to-sequence editing task with explicit preservation constraints, rather than a simple text-cleaning operation.
Layered evaluation: DRES combines automatic metrics with human judgments to capture both objective accuracy and perceived fluency.
Transparent baselines: A curated set of baseline models and prompting strategies helps isolate the effects of model capability from prompt engineering.

“Disfluency removal isn’t just deleting phrases; it’s about maintaining speaker intent and the natural flow of language.”

Benchmark design and methodology

The benchmark is built around two core components: data and evaluation. On the data side, DRES assembles paired examples consisting of the original disfluent transcript and a cleansed reference. These pairs come from:

Real-world transcripts annotated for disfluencies, covering fillers, repetitions, and repairs.
Synthetic augmentations where controlled disfluencies are injected into fluent text to stress-test model robustness.

Evaluation unfolds across multiple metrics that address different facets of performance:

Disfluency Deletion F1: How accurately the model identifies and removes disfluent segments without altering the surrounding content.
Content Preservation (Semantic Match): Measures how well the cleaned output preserves the meaning and key information from the reference.
Fluency and Readability: Objective scores (e.g., perplexity-based proxies) plus human ratings of naturalness.
edits Coherence: Checks that the final text remains logically coherent and consistent with the speaker’s intent.
Composite Disfluency Removal Score (DRS): A holistic metric that combines deletion accuracy, content preservation, and fluency to yield an interpretable benchmark score.

Evaluation protocol emphasizes fairness and replicability. Holdout test sets ensure domain generalization, and a standardized prompting framework helps disentangle model capability from prompt quirks. Human evaluators focus on whether the cleaned transcript conveys the same information, preserves speaker stance, and sounds natural to listeners.

Models and prompting strategies

Participants often benchmark a spectrum of LLMs, from open-weight models to proprietary systems. Common approaches include:

Zero-shot prompts: The model is asked to remove disfluencies directly without task-specific fine-tuning.
Few-shot prompts: Examples illustrate the desired edits, helping the model learn the normalization style and preservation priorities.
Structured editing prompts: The model is guided to produce a cleaned transcript with explicit annotations for deleted segments and retained content.
Iterative refinement: A two-pass approach where the model first proposes a cleaned version and then is asked to correct any remaining issues or ambiguities.

Leading LLMs, such as contemporary GPT-family models and other high-capacity assistants, are typically tested alongside smaller, open models to gauge the trade-offs between performance, latency, and cost. Prompt design often matters as much as model size—careful instructions about preserving meaning, handling paraphrase, and avoiding unintended edits can shift outcomes significantly.

Interpreting results and practical takeaways

Across domains, results tend to reveal a trade-off: stronger models excel at removing disfluencies cleanly while retaining nuance, but they can occasionally over-edit, removing content that participants would deem meaningful. A few practical patterns emerge:

Preservation over aggressiveness: The best systems achieve a balance—removing fillers and repairs without altering core facts or speaker intent.
Domain-aware prompting: Tailoring prompts to reflect domain style (academic vs. casual) improves both fluency and factual integrity.
Human-in-the-loop considerations: For high-stakes transcripts (e.g., legal or medical), incorporating human review stages reduces the risk of misinterpretation.

Practical takeaway: when integrating disfluency removal into workflows, start with a strong planning step that defines what “clean” means for your domain, pair a capable model with careful prompting, and validate outputs with human reviewers where accuracy is critical.

Challenges and avenues for improvement

Disfluency patterns can be highly idiosyncratic, and prosody—the rhythm and intonation of speech—often guides meaning. While text-only benchmarks are valuable, future iterations of DRES may explore richer, multimodal representations or user-centric metrics that account for listener perception and task success. Additionally, cross-lingual disfluency handling remains an open frontier, as speech patterns and repairs vary widely across languages.

Future directions

Real-time disfluency removal: Evaluating models in streaming settings where latency matters as much as accuracy.
Personalized formatting: Adapting edits to individual speaker styles or organizational guidelines.
Robustness to noisy transcripts: Handling OCR errors, transcription mistakes, and background noise without compromising content.

As the field advances, DRES aims to serve as a clear, bipartisan benchmark for comparing how well LLMs can transform messy, real-world talk into clean, faithful transcripts—without losing the voice, intent, or information that matter most.