Initialize a SemHash Instance
To use SemHash for deduplication, you first need to initialize aSemHash instance with your dataset.
This will build an index, which can then be used for fast deduplication.
You can set several parameters here, such as the model to use.
The default model is minishlab/potion-base-8M, which is a lightweight model that works well for most English text datasets.
For multilingual datasets, you can use minishlab/potion-multilingual-128M, which is optimized for multilingual data.
Note that you can also use your own custom model, or any SentenceTransformer model.
Parameters
Parameters
A list of records (strings or dictionaries).
Columns to featurize if records are dictionaries.
Optional Encoder model. If
None, uses the default (minishlab/potion-base-8M).The ANN backend to use for similarity search. Options include
Backend.USEARCH (default), Backend.FAISS, Backend.BASIC (exact search), and others supported by Vicinity.Additional keyword arguments to pass to the Vicinity index (e.g., backend-specific parameters).
Deduplicate a Single Dataset
To deduplicate a single dataset, you can use theself_deduplicate method.
This will remove semantic duplicates from the dataset.
Parameters
Parameters
Similarity threshold for deduplication.
Deduplicate Across Multiple Datasets
To deduplicate across multiple datasets, you can use thededuplicate method.
This allows you to remove duplicates from one dataset against another dataset, which is useful for ensuring that your test set does not overlap with your training set.
Parameters
Parameters
Deduplicate a Multi-Column Dataset
If you have a multi-column dataset, you can deduplicate it by specifying the columns to use for deduplication. For example, if you have a question-answering dataset withquestion and context columns, you can deduplicate based on both columns.
This will filter out records that have similar questions and contexts, ensuring that you do not have redundant entries in your dataset.
This is useful for datasets like SQuAD, where you can have the same question asked with different contexts, and you want to ensure that each question-context pair is unique.
Deduplicate Image Datasets
SemHash works with any modality, including images. Here’s an example using a vision model:Inspecting Deduplication Results
TheDeduplicationResult object provides powerful tools for understanding and refining your deduplication:
Key Attributes
selected: The deduplicated records that were keptfiltered: The records that were removed as duplicates, each with:record: The original record that was removedexact: Whether the record was an exact duplicateduplicates: A list of (record, similarity_score) pairs showing what it matched
duplicate_ratio: The ratio of duplicates found in the datasetexact_duplicate_ratio: The ratio of exact duplicates found in the dataset
Key Methods
get_least_similar_from_duplicates(n): Returns the n least similar records from the duplicates. This helps you find the right deduplication threshold by showing borderline cases.rethreshold(threshold): Re-applies the deduplication with a new threshold without rebuilding the index. This allows you to quickly adjust the sensitivity of the deduplication.selected_with_duplicates: Returns each kept record along with its duplicate cluster, useful for understanding what was grouped together.
