By default, SemHash uses potion-base-8M as its embedding model.
This model is very fast, and works well for most usecases. However, it’s trained for English data,
and might not perform as well as larger models. Fortunately, you can easily swap the model. Any model that follows our
encoder protocol will work.
This means that Model2Vec models and Sentence Transformers will work out of the box.
We recommend using potion-multilingual-128M for multilingual datasets.
The following example shows how to use a Model2Vec model with SemHash:
Copy
from datasets import load_datasetfrom model2vec import StaticModelfrom semhash import SemHash# Load a dataset to deduplicatetexts = load_dataset("ag_news", split="train")["text"]# Load an embedding model (in this example, a multilingual model)model = StaticModel.from_pretrained("minishlab/potion-multilingual-128M")# Initialize a SemHash with the model and custom encodersemhash = SemHash.from_records(records=texts, model=model)# Deduplicate the textsdeduplicated_texts = semhash.self_deduplicate()
The following example shows how to use a Sentence Transformer with SemHash:
Copy
from datasets import load_datasetfrom semhash import SemHashfrom sentence_transformers import SentenceTransformer# Load a dataset to deduplicatetexts = load_dataset("ag_news", split="train")["text"]# Load a sentence-transformers modelmodel = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")# Initialize a SemHash with the model and custom encodersemhash = SemHash.from_records(records=texts, model=model)# Deduplicate the textsdeduplicated_texts = semhash.self_deduplicate()