AI Text Deduplicator

Remove duplicate and near-duplicate lines from text using exact matching and Jaccard similarity.

Lines in1
Unique lines1
Duplicates removed0

Related Tools

Learn More

FAQ

What is Jaccard similarity and how does it work here?
Jaccard similarity compares two lines as sets of words: it divides the size of the intersection (shared words) by the size of the union (all unique words). A score of 1.0 means identical word sets; 0.9 means 90% overlap. Lines above the threshold are considered near-duplicates.
What threshold should I use for training data deduplication?
For strict deduplication of training data, 0.9 is a good default — it catches paraphrases and minor edits while keeping genuinely different examples. Lower the threshold to 0.7-0.8 if you want to aggressively deduplicate semantically similar content.
Does order matter? Which duplicate is kept?
Yes — the first occurrence of each line is kept and subsequent duplicates (exact or near) are removed. Sort your input first if you want to keep specific variants.

Deduplicate training data, prompt examples, or any line-based text. Supports exact duplicate removal and near-duplicate detection using Jaccard word-set similarity with an adjustable threshold. See stats on how many lines were removed.