Linear Ensembles Wash Away Watermarks

On the Fragility of Distributional Perturbations in LLMs

Zhihao Wu*,1, Gracia Gong*,2, Qinglin Zhu1, Yudong Chen3, Runcong Zhao1
1King's College London   2Imperial College London   3University of Warwick
International Conference on Machine Learning (ICML), 2026
*Indicates Equal Contribution
Probability distributions over the token space for unwatermarked models, individually watermarked models, and the watermark ensemble, showing that ensembling recovers the consensus distribution.
Token-level ensembling washes away watermarks. Each watermark pushes a model's output distribution in its own direction (middle). Averaging the watermarked models into an ensemble (right) cancels these independent perturbations, recovering the original consensus distribution (left) obtained from the unwatermarked models.

Abstract

Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging across 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR@5%FPR to below 50%, while improving quality by 27.5% and running 6× faster than the best baseline on long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.

How WASH Works

Overview of WASH: outputs from multiple watermarked models are averaged into an ensemble distribution, a token is sampled, and fluency-aware routing resynchronises heterogeneous tokenisers.
Overview of WASH. At each step, the outputs of multiple independently watermarked models are averaged into a consensus distribution ensemble, cancelling their independent watermark perturbations before a token is sampled. When the sampled token falls outside the shared vocabulary, fluency-aware routing commits to the specialist models to complete the current word, reconciling tokenisation differences across heterogeneous models.

Results

Detection z-score versus ensemble size for a fixed base model, for independent models, and for a coordinated watermark, showing the watermark signal decays below the detection threshold as ensemble size grows.
Watermark signal decay under different ensemble configurations. Detection strength (z-score) as the ensemble size N increases. (a) A fixed base model with N independent watermark keys. (b) N independent models, each with independent watermarks. (c) Three independent base models sharing the same coordinated watermark: the signal persists, showing that coordination defeats averaging attacks.
Table: native-detector watermark removal results with final-text rewrite attacks, reported as TPR at 5% FPR, where lower is better. WASH (N=5) matches or beats prior attacks across DIPMark, KGW, AAR, ITS-Edit and Exp-Edit.
Watermark removal. Native-detector TPR@5%FPR (lower is better). WASH sharply reduces detection rates from the watermarked baseline across all five schemes, performing on par with the strongest dedicated removal attacks.
Table: comparison with final-text rewriting on GSM8K and WritingBench, with runtime normalised to the watermarked baseline. WASH (N=5) achieves the best accuracy and score at a fraction of the runtime of competing rewriting attacks.
Quality & runtime. Accuracy on GSM8K and writing score on WritingBench, with runtime normalised to the watermarked baseline (higher quality / lower runtime are better). WASH achieves the highest quality on both benchmarks while adding far less runtime overhead than the other removal methods.

BibTeX

@inproceedings{wu2026wash,
  title     = {Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs},
  author    = {Wu, Zhihao and Gong, Gracia and Zhu, Qinglin and Chen, Yudong and Zhao, Runcong},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.30501}
}