SemHash is a lightweight and flexible tool for deduplicating datasets, filtering outliers, and finding representative samples using semantic similarity.
It combines fast embedding generation from Model2Vec with efficient ANN-based similarity search through Vicinity.SemHash supports both single-dataset deduplication & filtering (e.g., cleaning up a train set by removing duplicates and outliers) and multi-dataset deduplication & filtering (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.
Then, you can use it like this to deduplicate, filter outliers, and select representative samples (this example assumes you have the datasets library installed):
Copy
from datasets import load_datasetfrom semhash import SemHash# Load a dataset to deduplicatetexts = load_dataset("ag_news", split="train")["text"]# Initialize a SemHash instancesemhash = SemHash.from_records(records=texts)# Deduplicate the textsdeduplicated_texts = semhash.self_deduplicate().selected# Filter outliersfiltered_texts = semhash.self_filter_outliers().selected# Find representative textsrepresentative_texts = semhash.self_find_representative().selected
For advanced usage, check out the other documentation pages.