DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

By Leila Khatri | 2025-09-26_02-53-04

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Evaluating large language models (LLMs) hinges on clean, representative test data. When test sets leak into training or prompts, results can be misleading. DyePack offers a principled approach that uses backdoors to reveal contamination, providing a provable signal that something in the data is being inappropriately exploited by the model.

Why contamination sneaks into benchmarks

What DyePack brings to the table

DyePack reframes test integrity as a detection problem. By inserting controlled, synthetic backdoors into the evaluation pipeline, it asks a simple question: does the model respond to a backdoor trigger as if it had access to test content? If the answer is yes, we have a flag that the evaluation data may be contaminated. Crucially, the method aims for provable signals under clearly stated assumptions, not just empirical observations.

How the backdoor harness works

“DyePack gives us a structured, testable signal for leakage that traditional benchmarks often overlook.”

Provable guarantees in practice

A practical workflow you can adopt

What this means for practitioners

For teams designing and reporting LLM benchmarks, DyePack offers a complementary lens to verification. It doesn’t replace accuracy metrics, but it strengthens trust by exposing hidden leakage that can contaminate results. When contamination is detected, you gain a concrete action plan: isolate the affected items, document the signal, and adjust the evaluation protocol to prevent recurrence.

Limitations and paths forward

Like any technique, DyePack relies on explicit assumptions. Its sensitivity depends on the chosen triggers and the diversity of prompts. In highly adaptive models, backdoors can be harder to detect, and care must be taken to avoid overfitting the backdoor design itself. Ongoing work aims to broaden the trigger repertoire, improve robustness to model updates, and integrate with continuous benchmarking systems so that contamination flags travel with model releases.