Skip to main content
SemHash is a lightweight library for semantic deduplication, outlier filtering, and representative sample selection. It’s fully multimodal: text works out-of-the-box with fast Model2Vec embeddings, and you can bring your own encoders for images, audio, or custom models. SemHash supports both single-dataset operations (clean a training set) and cross-dataset operations (deduplicate test against train). It works with simple lists and complex multi-column datasets, and includes inspection tools to help you understand and refine results. All operations use Vicinity for efficient similarity search.

Quick Start

Install SemHash with the following command:
pip install semhash

Text Deduplication

Use SemHash to deduplicate text datasets (this example assumes you have the datasets library installed):
from datasets import load_dataset
from semhash import SemHash

# Load a dataset to deduplicate
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate().selected

# Filter outliers
filtered_texts = semhash.self_filter_outliers().selected

# Find representative texts
representative_texts = semhash.self_find_representative().selected

Image Deduplication

Use SemHash with vision models for image deduplication (requires pip install sentence-transformers):
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from semhash import SemHash

# Load an image dataset and vision model
model = SentenceTransformer("clip-ViT-B-32")
dataset = load_dataset("uoft-cs/cifar10", split="train")

# Initialize a SemHash instance with the 'img' column
semhash = SemHash.from_records(list(dataset), columns=["img"], model=model)

# Deduplicate the images
deduplicated_images = semhash.self_deduplicate().selected

# Filter outliers
filtered_images = semhash.self_filter_outliers().selected

# Find representative images
representative_images = semhash.self_find_representative().selected
For advanced usage and custom encoders, check out the other documentation pages.