Semantic Deduplication

SemHash can be used to efficiently deduplicate datasets based on semantic similarity. This allows you to not only remove exact duplicates and near-duplicates, but also samples that are semantically similar, unlike methods like MinHash, which only consider matches that share exact n-grams.

This is particularly useful for cleaning up datasets where you want to ensure that similar entries are not counted multiple times, such as in training datasets for machine learning models, or a RAG application where you want to avoid redundancy in your knowledge base.

Initialize a SemHash Instance

To use SemHash for deduplication, you first need to initialize a SemHash instance with your dataset. This will build an index, which can then be used for fast deduplication. You can set several parameters here, such as the model to use. The default model is minishlab/potion-base-8M, which is a lightweight model that works well for most English text datasets. For multilingual datasets, you can use minishlab/potion-multilingual-128M, which is optimized for multilingual data. Note that you can also use your own custom model, or any SentenceTransformer model.

from datasets import load_dataset
from semhash import SemHash

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

Parameters

Parameter	Type	Default	Description
`records`	`Sequence[Record]`		A list of records (strings or dictionaries).
`columns`	`Sequence[str] \| None`	`None`	Columns to featurize if records are dictionaries.
`model`	`Encoder \| None`	`None`	Optional Encoder model. If `None`, uses the default (minishlab/potion-base-8M).
`ann_backend`	`Backend \| str`	`Backend.USEARCH`	The ANN backend to use for similarity search. Options include `Backend.USEARCH` (default), `Backend.FAISS`, `Backend.BASIC` (exact search), and others supported by Vicinity.
`**kwargs`	`Any`		Additional keyword arguments to pass to the Vicinity index (e.g., backend-specific parameters).

Deduplicate a Single Dataset

To deduplicate a single dataset, you can use the self_deduplicate method. This will remove semantic duplicates from the dataset.

from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate().selected

Parameters

Parameter	Type	Default	Description
`threshold`	`float`	`0.9`	Similarity threshold for deduplication.

Deduplicate Across Multiple Datasets

To deduplicate across multiple datasets, you can use the deduplicate method. This allows you to remove duplicates from one dataset against another dataset, which is useful for ensuring that your test set does not overlap with your training set.

from datasets import load_dataset
from semhash import SemHash

# Load two datasets to deduplicate
train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance with the training data
semhash = SemHash.from_records(records=train_texts)

# Deduplicate the test data against the training data, optionally with a specific threshold
deduplicated_test_texts = semhash.deduplicate(records=test_texts).selected

Parameters

Parameter	Type	Default	Description
`records`	`Sequence[Record]`		The new set of records (e.g., a test set) to deduplicate against the fitted dataset.
`threshold`	`float`	`0.9`	Similarity threshold for deduplication.

Deduplicate a Multi-Column Dataset

If you have a multi-column dataset, you can deduplicate it by specifying the columns to use for deduplication. For example, if you have a question-answering dataset with question and context columns, you can deduplicate based on both columns. This will filter out records that have similar questions and contexts, ensuring that you do not have redundant entries in your dataset. This is useful for datasets like SQuAD, where you can have the same question asked with different contexts, and you want to ensure that each question-context pair is unique.

from datasets import load_dataset
from semhash import SemHash

# Load the dataset
dataset = load_dataset("squad_v2", split="train")

# Convert the dataset to a list of dictionaries
records = [dict(row) for row in dataset]

# Initialize SemHash with the columns to deduplicate
semhash = SemHash.from_records(records=records, columns=["question", "context"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate().selected

Deduplicate Image Datasets

SemHash works with any modality, including images. Here’s an example using a vision model:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from semhash import SemHash

# Load image dataset and vision model
model = SentenceTransformer('clip-ViT-B-32')
train_dataset = load_dataset("uoft-cs/cifar10", split="train")
test_dataset = load_dataset("uoft-cs/cifar10", split="test")

# Initialize with training images
semhash = SemHash.from_records(list(train_dataset), columns=["img"], model=model)

# Single-dataset deduplication
deduplicated_train = semhash.self_deduplicate().selected

# Cross-dataset deduplication (remove test images that appear in train)
deduplicated_test = semhash.deduplicate(list(test_dataset)[:1000]).selected

See the Custom Models page for more details on using vision encoders.

Inspecting Deduplication Results

The DeduplicationResult object provides powerful tools for understanding and refining your deduplication:

from datasets import load_dataset
from semhash import SemHash

# Load and deduplicate a dataset
texts = load_dataset("ag_news", split="train")["text"]
semhash = SemHash.from_records(records=texts)
result = semhash.self_deduplicate()

# Access deduplicated and duplicate records
deduplicated_texts = result.selected
duplicate_texts = result.filtered

# View deduplication statistics
print(f"Duplicate ratio: {result.duplicate_ratio}")
print(f"Exact duplicate ratio: {result.exact_duplicate_ratio}")

# Find edge cases to tune your threshold
least_similar = result.get_least_similar_from_duplicates(n=5)

# Adjust threshold without re-deduplicating
result.rethreshold(0.95)

# View each kept record with its duplicate cluster
for item in result.selected_with_duplicates:
    print(f"Kept: {item.record}")
    print(f"Duplicates: {item.duplicates}")  # List of (duplicate_text, similarity_score)

Key Attributes

selected: The deduplicated records that were kept
filtered: The records that were removed as duplicates, each with:
- record: The original record that was removed
- exact: Whether the record was an exact duplicate
- duplicates: A list of (record, similarity_score) pairs showing what it matched
duplicate_ratio: The ratio of duplicates found in the dataset
exact_duplicate_ratio: The ratio of exact duplicates found in the dataset

Key Methods

get_least_similar_from_duplicates(n): Returns the n least similar records from the duplicates. This helps you find the right deduplication threshold by showing borderline cases.
rethreshold(threshold): Re-applies the deduplication with a new threshold without rebuilding the index. This allows you to quickly adjust the sensitivity of the deduplication.
selected_with_duplicates: Returns each kept record along with its duplicate cluster, useful for understanding what was grouped together.