approximate deduplication
: we want to remove documents that are semantically very similar from a corpus. Previous text deduplication algorithms, like minhash or simhash, operate on character or word ngrams, and therefore only find similarity between sequences that are orthographically similar, and ignore semantic similarity.
While deduplication sounds like something that can only benefit LLM training, it can also be really beneficial to check small datasets for overlap: having even approximate overlap between train and test leads to performance overestimation, and having approximate duplicates in train leads to wasted compute, overestimation of feature importance, and a potential host of other issues.
Additionally, deduplication techniques can also be used to give you a bird’s eye view of larger datasets: checking approximate duplicates using semhash
takes (milli)seconds, and allows you to see which items from your dataset look alike. If these make sense: great! If there are no duplicates… also great! Everything is better than training on incorrect data.
semhash
with a low threshold, you can quickly get an overview of which documents are similar to others, and which aren’t. This gives you a good idea of what to focus on, what kind of things are missing from your data, and how your documents relate to one another.
semhash
takes as input a collection of strings or dictionaries. You first initialize a model using set of reference documents, and then use this set of documents to deduplicate an incoming set. Any incoming document that is similar to a document from the reference set is removed, and stored separately with its approximate duplicates from the reference set.
semhash
can also be used to investigate your dataset. By using self_deduplicate
, you can deduplicate the training set itself, which we will use as a jumping off point:
result
. First off, you can just get all deduplicated records:
semhash
within other ML pipelines. semhash
doesn’t change your data, it just reduces it in size.
You can easily see the proportion of records that were duplicates:
result.get_least_similar_from_duplicates
start making sense. In our experiments, however, a threshold if 0.9, which is the default, works fine, but be sure to check for your individual use-cases.
semhash
also supports multi-column datasets, allowing you to deduplicate datasets that have text in multiple columns. For example, in QA datasets, you don’t just want to deduplicate similar questions or similar contexts, but you want to only count items in which both fields are sufficiently similar as duplicated.
This is a difficult problem to tackle, but semhash
can also handle this.
The following snippet demonstrates how this works: