SemHash
Fast Semantic Text Deduplication & Filtering
SemHash is a lightweight and flexible tool for deduplicating datasets, filtering outliers, and finding representative samples using semantic similarity. It combines fast embedding generation from Model2Vec with efficient ANN-based similarity search through Vicinity.
SemHash supports both single-dataset deduplication & filtering (e.g., cleaning up a train set by removing duplicates and outliers) and multi-dataset deduplication & filtering (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.
Quick Start
Install SemHash with the following command:
Then, you can use it like this to deduplicate, filter outliers, and select representative samples (this example assumes you have the datasets
library installed):
For advanced usage, check out the other documentation pages.