DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Evaluating large language models (LLMs) hinges on clean, representative test data. When test sets leak into training or prompts, results can be misleading. DyePack offers a principled approach that uses backdoors to reveal contamination, providing a provable signal that something in the data is being inappropriately exploited by the model.

Why contamination sneaks into benchmarks

Data reuse across tasks and runs creates hidden channels for leakage.
Prompt engineering and hidden prompts can inadvertently surface test content.
Model developers might tune or adapt outputs to fit familiar test motifs, masking real-world capability.

What DyePack brings to the table

DyePack reframes test integrity as a detection problem. By inserting controlled, synthetic backdoors into the evaluation pipeline, it asks a simple question: does the model respond to a backdoor trigger as if it had access to test content? If the answer is yes, we have a flag that the evaluation data may be contaminated. Crucially, the method aims for provable signals under clearly stated assumptions, not just empirical observations.

How the backdoor harness works

Trigger design: tiny, deliberate prompts or perturbations that should be neutral on clean data but elicit predictable changes if test data influenced the model.
Test pairing: a set of outputs with and without the trigger, drawn from the same distribution to isolate the backdoor effect.
Statistical signaling: formal tests compare response patterns to quantify the likelihood that observed changes are due to leakage rather than random variation.

“DyePack gives us a structured, testable signal for leakage that traditional benchmarks often overlook.”

Provable guarantees in practice

Assumptions are clearly stated: model behavior stability on neutral prompts, independence across test items, and sufficient prompt diversity.
Under those assumptions, DyePack provides bounds on false-positive rates for flagging contamination and a calibration procedure to control them.
The framework is designed to be compatible with existing evaluation pipelines, requiring only a controlled backdoor layer and an additional analysis pass.

A practical workflow you can adopt

Baseline and scope: define what constitutes “clean” evaluation data and decide which subsets to protect.
Embed backdoors: generate neutral, low-signal triggers and pair each test item with its backdoored variant.
Run paired evaluations: compare the model’s responses across backdoored and unbackdoored prompts.
Apply the statistical test: use pre-registered thresholds to decide whether the observed effects indicate contamination.
Flag and audit: label suspect items for manual inspection and, if needed, remove or replace them in the benchmark.

What this means for practitioners

For teams designing and reporting LLM benchmarks, DyePack offers a complementary lens to verification. It doesn’t replace accuracy metrics, but it strengthens trust by exposing hidden leakage that can contaminate results. When contamination is detected, you gain a concrete action plan: isolate the affected items, document the signal, and adjust the evaluation protocol to prevent recurrence.

Limitations and paths forward

Like any technique, DyePack relies on explicit assumptions. Its sensitivity depends on the chosen triggers and the diversity of prompts. In highly adaptive models, backdoors can be harder to detect, and care must be taken to avoid overfitting the backdoor design itself. Ongoing work aims to broaden the trigger repertoire, improve robustness to model updates, and integrate with continuous benchmarking systems so that contamination flags travel with model releases.