Minish | Blog

Model2Vec Size Improvements

Sun, 05 Oct 2025 00:00:00 GMT

Introduction

Over the past year, we’ve implemented several ways to reduce the size of Model2Vec models. Due to the nature of our distillation technique, Model2Vec distilled models are already relatively compact, but we can make them even smaller (~6mb), as we will show in this blogpost. This can be beneficial for deployment in resource-constrained environments such as edge and mobile devices, where memory and storage are limited. It also means we can load models faster, and serve more models at the same time.

Since all the parameters in a Model2Vec model are in the embedding matrix, we can reduce size in three ways:

By reducing the dimensionality of the embeddings
By reducing the precision of the embeddings (quantization)
By reducing the number of embeddings (the vocabulary size)

With our latest release, we can now directly modify all of these in Model2Vec. Let’s go over them one by one!

Overview

We use the following three techniques to reduce model size:

Principal Component Analysis (PCA) (available since our initial release)
Quantization (available since v0.5.0)
Vocabulary Quantization (our shiny new feature which we just released in v0.7.0)

1. PCA

The first and most straightforward way to reduce model size is dimensionality reduction, which we do with PCA. Most embedding models operate at high dimensions (e.g. 768), which is a lot more than we (usually) need for static embedding models.

2. Quantization

Next up is quantization. By default, embeddings are stored as 32-bit floats. By quantizing them to 16-bit floats, or even 8-bit integers, we can cut storage requirements by 2x-4x.

3. Vocabulary quantization

Finally, we can modify the vocabulary itself. Large vocabularies are expensive: every token needs its own vector. But many tokens are rare, and some are near-duplicates. With vocabulary quantization, we cluster embeddings using k-means and merge them, effectively compressing the vocabulary without throwing away coverage.

Results

Here’s how the different strategies stack up. For these experiments, we start with a distilled bge-base-en-v1.5 model using default parameters (baseline).

Model	Size	Average (MTEB)	Drop vs. Baseline
Baseline (768d, FP32)	92 MB	46.69	–
+ PCA (256d)	32 MB	46.63	-0.06
+ Quantization (INT8)	9 MB	46.60	-0.09
+ Vocab quantization (20k clusters)	6 MB	45.99	-0.70

As you can see, we can shrink a 92 MB model down to 6 MB (15x smaller!) while losing less than 1% performance on MTEB. Another interesting thing to see is that PCA and quantization have a very small effect on performance, and can essentially be applied without any trade-offs. Note that the vocabulary of the used base model is already quite small (~30k tokens). We expect vocabulary quantization to have a bigger effect on models with larger vocabularies (e.g. multilingual models), which we will explore in future work.

As always, we’d love to hear your feedback — let us know what you’re building with these tiny models, and if you want to try this yourself, grab the latest Model2Vec release!

Model2Vec as a fasttext alternative

Mon, 28 Jul 2025 00:00:00 GMT

Model2Vec is typically viewed as a fast alternative to a sentence transformer. There’s a good reason for that, because:

Model2Vec models are distilled versions of sentence transformers
Model2Vec models are drop-in replacements for sentence transformers

Having said that, a better comparison would actually be Meta’s fasttext. Like Model2Vec, fasttext can be used to create static vectors, and can also be used to create classifiers using a set of static vectors as a starting point. In this short blog post, we’ll show that off-the-shelf Model2Vec models are much better than fasttext classifiers and word vectors. So, if you’re currently using fasttext somewhere, consider a comparison to Model2Vec!

Classification

To test the classification efficacy of Model2Vec in comparison to fasttext, we ran experiments on 15 datasets from the setfit organization on Hugging Face. We initialized the fasttext classifier using the wiki-news-300d-1M-subword.vec vectors we got here, and used the nltk word_tokenize function as a tokenizer. The Model2Vec model was initialized from minishlab/potion-base-32m. All models were trained with sensible defaults. We optimized the tokenization for the fasttext model.

Performance

Here’s the full table for both approaches:

dataset_name	Model2Vec	fasttext
20_newsgroups	66.18	57.11
ade	88.05	86.14
ag_news	91.61	92.18
amazon_counterfactual	80.64	82.58
bbc	96.93	95.86
emotion	79.7	79.35
enron_spam	98.85	98.85
hatespeech_offensive	70.94	69.48
imdb	87.79	89.51
massive_scenario	88.78	87.26
senteval_cr	79.12	76.2
sst5	41.49	19.59
student	93.76	93.12
subj	91.94	92.6
tweet_sentiment_extraction	74.14	68.98
Average	81.99	79.25

Model2Vec outperforms fasttext on average, although with smaller gains than we anticipated.

Training time

Fasttext models train faster than Model2Vec models. Note that this just concerns the actual training of the supervised classifier; we don’t include any pretraining time. One observation is that Model2Vec models tend to train for a bit too long. Also, the training time is less than a minute for both approaches, so…

Inference time

Model2Vec processes about 14.6k samples per second, while fasttext processes about 3.6k.

This with the caveat that the inference time of both approaches is difficult to compare, since almost all of the time for both models is actually spent in the tokenizer. Disabling any preprocessing for fasttext makes it faster than Model2Vec (3.6k -> 25k (!) samples/second), albeit with a hit to performance (79.5 -> 78.5 average score). This underscores one of the painful issues of older NLP approaches, which is that preprocessing/tokenization matters a lot, and that it is difficult to find alignment between your tokenization and models found online.

Model size

The trained Model2Vec model is only 130 MB on disk, while the fasttext model is substantially larger at 2.1 gigabytes. Note, however, that both Model2Vec and fasttext can be compressed through quantization.

Zero shot (MTEB)

fasttext vectors can also be used as word2vec embeddings. As such, we can test how well they work on the Massive Text Embedding Benchmark (MTEB) as a zero shot embedding approach, comparing directly with a Model2Vec model. Following the above, we use the nltk word_tokenize function to tokenize text going into fasttext, and also normalize all the output vectors to unit length. We perform no additional preprocessing for Model2Vec. Because running MTEB can take a long time, we don’t run all subsets. We use the original MTEB benchmark.

Results

It’s honestly not looking too good for fasttext. Model2Vec blows it out of the water at all tasks, excluding WordSim, which is a set of word similarity tasks.

	fasttext	Model2Vec
Classification	51.97	65.97
Clustering	22.25	35.29
PairClassification	47.89	78.17
Reranking	40.7	50.92
STS	48.2	74.22
Summarization	29.41	29.78
WordSim	59.29	55.15

The only task on which fasttext performs well is Wordsim, which is a collection of lexical similarity tasks. This is interesting, because these kinds of datasets were popular around the time fasttext, GloVe, word2vec, and other static methods were initially created. So this could be one of the reasons these vectors work well: methods are developed with reference to the evaluation data that is available at the time of development.

Conclusion

If you’re still using fastText, be it for classification or for word embeddings, it’s probably time for an upgrade. Model2Vec offers smaller models, faster inference, and better downstream results in most cases. Give it a try, and benchmark for yourself!

Tokenlearn 0.2.0

Sat, 31 May 2025 00:00:00 GMT

We’ve released a new version of tokenlearn! It contains usability improvements, fixes some bugs, and has a new learning algorithm under the hood that improves performance. Read on to see what it does and how you can use it.

Why use tokenlearn?

Tokenlearn is a way to improve a distilled Model2Vec model by performing an additional knowledge distillation step using the base model (the sentence transformer you distilled) and a distilled Model2Vec model. The Model2Vec model is trained to directly mimic the vectors produced by the base model, which leads to massive improvements. Notably, this does not require any labeled data.

As an example: our new tokenlearn version was used to train our multilingual flagship model, potion-multilingual-128M. This model performs at about the same level as static-similarity-mrl-multilingual-v1 (which we will call MRL). The main difference between the two is how they were trained: MRL has been trained on 8.5 million cross-lingually aligned sentence pairs, while potion-multilingual has only been trained on 2 million random C4 passages. This shows the power of tokenlearn! You can adapt any Model2Vec model to a specific domain with a small number of short documents, no annotations needed.

How does it work?

Before starting on what’s new, let’s first go into how you can use tokenlearn. First, you need to select a base model, i.e., a sentence transformer you like using, and a dataset from which you will sample passages. For this, you need to use the featurize function.

from datasets import load_dataset
from sentence_transformers import SentenceTransformer

from tokenlearn.featurize import featurize

my_corpus = load_dataset("allenai/c4", "en", split="train", streaming=True)
model = SentenceTransformer("baai/bge-base-en-v1.5")
output_dir = "my_corpus_featurized"

featurize(
    dataset=my_corpus,
    model=model,
    output_dir=output_dir,
    max_means=2_000_000,
    batch_size=32,
    text_key="text"
)

Leave this running for a while, and you will get a set of documents and means in output_dir. Note that this script can be resumed, if the arguments are the same, the embedding computation will pick up where you left it. Now that you have the documents in this directory, you can fit a model on them.

from tokenlearn.train import train_model
from tokenlearn.utils import

model_name = "baai/bge-base-en-v1.5"
data_dir = "my_corpus_featurized"
vocab_size = 250_000

# Collect paths for training data
paths = sorted(Path(data_dir).glob("*.json"))
train_txt, train_vec = collect_means_and_texts(paths)

model = train_model(
    model_name,
    train_txt,
    train_vec,
    device=None,
    vocab_size=vocab_size,
    pca_dims=512
)

Running this command will get you a trained potion-like model, specifically fit for your domain. Two relevant options to keep in mind are vocab_size and pca_dims. These control the number of rows and columns in your embedding matrix, respectively.

In general, setting pca_dims to 256 or 512 should be good enough for most problems, and depends on the explained variance of your target vectors.

Setting the vocab_size parameter is more complicated. If vocab_size is > 0, we tokenize all texts before training, and select vocab_size words to add to the vocabulary of the distilled model based on their frequency. Whether this is useful really depends on the size of your training corpus, and how well it matches with your downstream task. If there’s a lot of lexical overlap between the two, you can see a large improvement in performance, although at significant memory costs, as each added vocabulary item adds a whole row to your embedding matrix. Even setting vocab_size to 0 will improve performance over a raw distill, however.

What does it do?

In short, tokenlearn training:

distills a Model2Vec model for you from a base model
adds vocabulary (if any) to the vocabulary of your model
perform PCA on the target embeddings we made using the base model
perform knowledge distillation from the Model2Vec model to the target embeddings

The knowledge distillation step is extremely simple: we simply reduce the Mean Squared Error (MSE) between the output vectors of the Model2Vec model and the output vectors of the base model, using a held-out set to perform early stopping. We separately optimize the embeddings and the norms of the static model, because we want to decouple the semantic of the token embeddings and the weight they have in a mean, and also want to encourage the model to pay attention to the weight each individual token has.

Applying PCA to both the base model and output embeddings turns out to be extremely important. If this is not done, the knowledge distillation step does not work at all.

Differences between the new and old tokenlearn

In the old tokenlearn, we also applied a post-processing step, wherein we applied PCA over the learned weights, and then applied a SIF-like transform to the embeddings. These steps are now no longer necessary.

New improvements to model2vec distillation

Wed, 05 Feb 2025 00:00:00 GMT

We’ve made a lot of improvements to model2vec since it came out, many of which target the baseline performance of our distillation process.

This post details how the distillation process has changed over time, and how this has impacted baseline performance of model2vec models. Spoiler alert: if you’ve distilled a model a couple of months ago, it can really pay off to update model2vec and re-run the distillation process.

Improvements

Here are the improvements, in order of their appearance. In the last section, we’ll contrast all of them, and show their impact MTEB performance.

For all experiments, we distill baai/bge-base-en-v1.5 using the default parameters.

Basic

As a reference, the basic operations we apply when distilling are:

Token selection: we propagate all individual tokens through the model together with an EOS and BOS token, and then select the middle token as the representation.
PCA: apply PCA with a specific number of dimensions (256 for all models).
Zipf: we weight all individual tokens by estimating their frequency using Zipf’s law. The short of it is that we assume all tokens in the vocabulary are in rank order, and that they follow a power law distribution.

We tried many variations on this theme, including:

Replacing PCA: we tried ICA, umap, and T-SNE. All worked a lot worse.
Using different propagation strategies: we tried not including BOS/EOS, either only BOS or only EOS, and pooling over the BOS token (i.e., [CLS] pooling).
Using different weighting strategies, including TF-IDF.

None of these really had the desired effect, but feel free to let us know if you come up with something else!

The basic performance of our model with these strategies is on MTEB is 45.34. We released this model as m2v_base_output.

1. Pooling

As a first change: we switched from selecting the token to mean pooling, that is, the representation of a token is the mean of the EOS token BOS we pass forward through the network.

In code:

# Before:
embedding = model(["EOS token BOS"])[:, 1]
# Now:
embedding = model(["EOS token BOS"]).mean(1)

We also tried a variety of other pooling strategies, including selecting specific tokens, adding queries, and adding prompts.

This raises the average score from 45.34 to 45.91, but has a larger effect on models that don’t perform well to begin with, such as modernBERT-based models.

2. SIF weighting

Following this, we replaced the Zipf weighting with a strategy based on the well-known SIF algorithm. In short, this algorithm creates a probability distribution over all tokens in the vocabulary, and downweights very frequent tokens, while upweighting very infrequent tokens. For weighting, it uses the following formula:

sif = alpha / (alpha * proba)

Where proba is a vector of token probabilities. As before, we use Zipf’s law to actually estimate the token probabilities, because we don’t actually have access to them. Applying this on top of the mean pooling raises the score from 45.91 to 47.40.

3. Normalization

Normalization has been a part of model2vec from the very first version. This is a boolean flag that, when set to True, unit normalizes all output vectors. This is set to False by default, but this turns out to be a bad choice. Setting it to True has a significant positive effect, especially on retrieval and clustering, and raises the average score from 47.40 to a whopping 47.79.

Taking stock

If you want more details, you can find the full table below. As you can see, the improvements we found are general, in the sense that they improve performance for all tasks except PEARL. Anecdotally, this also seems to hold for other models we tried.

	m2v_base_output	+mean pooling	+sif	+norm
Average (All)	46.79	47.32	48.42	48.59
Average (MTEB)	45.34	45.91	47.4	47.79
Classification	61.25	61.43	63.76	63.22
Clustering	25.58	26.13	27.19	29.71
PairClassification	74.9	75.23	74.9	75.22
Reranking	47.63	47.73	48.29	48.29
Retrieval	26.14	27.17	28.93	28.93
STS	68.58	69.31	70.89	70.89
Summarization	29.2	29.45	29.32	29.35
PEARL	54.02	54.22	53.88	52.73
WordSim	49.18	49.7	49.63	49.63

As you can see, adding the improvements increases the scores for distillations across all tasks, with PEARL being the notable exception.

Where to go from here?

One active area of improvement is to make it a lot easier to tune your model on a specific dataset, so that the model gains knowledge about the specific problem or language you’re trying to tackle. This will come up in a next release.

As always, if you have questions, don’t hesitate to reach out!

ModernBERT support and why it doesn't work

Wed, 29 Jan 2025 00:00:00 GMT

Our newest shiny release is here! 0.3.8! This is a small release in line for a big one we’ll be releasing next week. See here for the release notes.

The biggest feature in this release is support for ModernBERT! As the name implies, ModernBERT is a refresh of the venerable BERT model, trained on more data, with lots of nice tricks; harder, better, faster, stronger. Since its release at the end of last year, many embedders based on ModernBERT have appeared, including:

And probably many more.

We didn’t support ModernBert out of the box because of a ~~bug~~ design decision, which we fixed in this release. Frustratingly, however, distilling a very good ModernBERT model does not lead to a good model2vec model. This blog post details why we think that is the case: we give a bunch of numbers and some explanations.

Distilling ModernBERT

As you probably know, a model2vec model is created by:

Downloading an existing sentence transformer
Embedding all tokens in the vocabulary (without context)
Reducing the resulting embeddings in size using PCA
Reweighting them using Zipf weighting

As ModernBERT-based models have about 50k tokens in their tokenizer, this is also how many embeddings our model2vec model will have.

So, we created a model2vec distill of both ModernBERT-based embedders above. We fully expected this to work well, because in previous experiments, we saw that BERT-based encoder models worked best for model2vec distillation.

Here’s the scores on a subset of MTEB tasks, compared with a straight BAAI/bge-base-en-v1.5 distill. Note that both gte-modern-bert-base and nomic-ai/modernbert-embed-base outperform bge-base-en-v1.5 on the MTEB leaderboard, so we expected a distilled model to also perform better.

	STS	WordSim	Classification
bge-base-en-v1.5	69.3	49.7	62.4
gte-modernbert-base	66.5 (-2.8)	25.6 (-24.1)	60.4 (-2.0)
modernbert-embed-base	65.1 (-4.2)	26.1 (-23.6)	59.4 (-3.0)

As you can see, that’s not the case at all. bge-base-en-v1.5 outperforms both ModernBERT-based distills on all tasks, and with a huuuuuge margin on WordSim. Luckily for us, the WordSim task provides us with a good reason for why this is the case.

WordSim

First, let’s talk about WordSim! Wordsim is a very simple Semantic Textual Similarity task, comprised of 7 datasets, in which the cosine similarity two embeddings of single words are correlated with real-world judgments of similarity.

For example, if apple and pear are judged to be similar by humans, your model must give them a high cosine similarity in order to score high on this task.

This task is interesting to us because it provides us with an estimate of how good a model2vec model is at modeling lexical similarity without having access to any context. We also see that being, for model2vec models, performing well on WordSim also correlates with performance on other tasks.

What is interesting about the performance of ModernBERT on WordSim is that it is atrociously low, lower than any model we’ve seen before, and that it does not seem to correlate at all with performance on other tasks, on which it scores lower, but not atrociously low.

But why could this be the case, and why would it hold for both models? Because it seems to hurt both models equally, it looks like something in the base model is to blame.

In our view, the answer is likely to be the tokenizer used in ModernBERT. ModernBERT’s tokenizer, unlike the traditional BERT tokenizer, which is used in a lot of embedders, is a byte-pair encoding (BPE) tokenizer. To see what this means, let’s take a look at five random BPE tokens from ModernBERT’s tokenizer:

Ġnickel
ercul
tar
^),
encephal

As you can probably see, these tokens are not very likely to be informative by themselves: we can’t just embed ercul and expect something useful. In contrast, here’s five tokens from the WordPiece-based BERT tokenizer:

lastly
##ect
electro
defendants
ventured

As you can see, the WordPiece tokenizer has tokens that are more easily interpreted as words. Because BPE tokens are less likely to be words or naturally occurring suffixes, the model likely has to perform more operations to contextualize words, making it a bad fit for uncontextualized embeddings, such as model2vec models.

In addition: the tokenizer used in ModernBERT is a cased tokenizer, which means that it contains both upper- and lowercase tokens. But, again, without any contextual cues, there is very little difference between upper- and lowercase tokens.

We think that both of these factors combined, but especially the BPE tokens, lead to low performance of the distilled model. The fact that both of the ModernBERT based models suffer from the same issue shows that the issue is likely caused by the base model, and not the specific fine-tuning strategy used.

Fixes we tried

Of course, we realize you might be skeptical after reading this, so here’s some things we tried:

Using CLS pooling
Using mean pooling
Pooling by selecting the wordpiece
Reversing the order of the BOS/EOS tokens
Not applying PCA
Not applying Zipf

And all combinations of the above.

Future work

Support for mutating BPE tokenizers in model2vec is lacking: we don’t allow vocabulary changes for BPE tokenizers, but we do allow it for WordPiece tokenizers.

If token removal was allowed, we could test whether the casing affects performance. If adding tokens to the tokenizer was allowed, we could see whether adding the words in WordSim would improve performance.

So one thing on our roadmap, but a very low priority one, is to add support for token addition and/or removal to model2vec. If you have an idea on how to do it, please let us know!

semhash: deduplication and dataset multitool

Sun, 12 Jan 2025 00:00:00 GMT

We’re super excited to announce the release of semhash, our semantic deduplication and dataset multitool (other features coming soon).

Introduction

One area of recent interest, especially around training Large Language Models (LLMs), is that having a lot of data is great, but having a little bit less high quality data is even better. A good example of this can be found in the fineweb blogpost, where the authors start from a really big set of common crawl dumps, on which they perform many quality checks, including dedupication and a suite of quality checks.

At Minish, we’re interested in unlocking new possibilities by making very fast models. As you may know, we created the best smallest fast model in the world, potion-base-8m. One of the areas we are interested in is approximate deduplication: we want to remove documents that are semantically very similar from a corpus. Previous text deduplication algorithms, like minhash or simhash, operate on character or word ngrams, and therefore only find similarity between sequences that are orthographically similar, and ignore semantic similarity.

While deduplication sounds like something that can only benefit LLM training, it can also be really beneficial to check small datasets for overlap: having even approximate overlap between train and test leads to performance overestimation, and having approximate duplicates in train leads to wasted compute, overestimation of feature importance, and a potential host of other issues.

Additionally, deduplication techniques can also be used to give you a bird’s eye view of larger datasets: checking approximate duplicates using semhash takes (milli)seconds, and allows you to see which items from your dataset look alike. If these make sense: great! If there are no duplicates… also great! Everything is better than training on incorrect data.

How can I use deduplication?

Here’s some cool use-cases to give you an idea on when deduplication makes sense:

Classification

As mentioned above, it is important that there is no overlap in information between your train and test splits. Having overlap generally means that you overestimate performance, because the model no longer needs to generalize to perform well. Removing duplicates from within the train set, however, can also be very useful. Having a large number of duplicates of the same record in the training set makes the model overestimate the importance of the features of that record, and, in any case, leads to wasted compute and an overestimation of model fit.

RAG systems

Duplicates in RAG systems sounds like something rare, until you consider that most RAG systems are built using chunks: while having completely duplicated documents will probably be rare, having duplicate chunks across documents or within documents is a lot more common. Having duplicate chunks in your knowledge base increases storage costs, increases the risk of retrieving irrelevant chunks, and forces you to implement diversification strategies much sooner than necessary.

Explain your corpus

By running semhash with a low threshold, you can quickly get an overview of which documents are similar to others, and which aren’t. This gives you a good idea of what to focus on, what kind of things are missing from your data, and how your documents relate to one another.

How does it work?

At its core, semhash takes as input a collection of strings or dictionaries. You first initialize a model using set of reference documents, and then use this set of documents to deduplicate an incoming set. Any incoming document that is similar to a document from the reference set is removed, and stored separately with its approximate duplicates from the reference set.

from datasets import load_dataset

from semhash import SemHash

dataset = load_dataset("ag_news")
train = dataset["train"]
test = dataset["test"]

# This creates an index over your train set. All records are stored in their entirety.
semhash = SemHash.from_records(records=train, columns=["text"])
# This deduplicates your texts with reference to `train`. Any items occurring in train are
# removed from test.
result = semhash.deduplicate(test, threshold=0.9)

# Set without duplicates
result.deduplicated

# Duplicates
result.duplicates

During fitting, all document are first encoded by an encoder. The default encoder is potion-base-8m, a model2vec model. The documents are then stored in a vicinity vector store, backed by usearch. Then, for an incoming set of documents, we first encode them using the specified encoder, and then retrieve the nearest neighbors from the vector store. Every incoming document that has a nearest neighbor with a similarity above the threshold gets removed.

Because all of these components are very fast, deduplicating even really large datasets only takes minutes. For example, deduplicating the entire Squad-2.0 dataset dataset, which has 130000 samples, only takes 7 seconds. This includes vectorization, fitting the index, and the actual deduplication. Smaller datasets only take a fraction of this time, while even datasets containing millions of documents only take minutes. For a comprehensive benchmark, see our benchmarks.

Explainability

semhash can also be used to investigate your dataset. By using self_deduplicate, you can deduplicate the training set itself, which we will use as a jumping off point:

from datasets import load_dataset

from semhash import SemHash

dataset = load_dataset("ag_news")
train = dataset["train"]
test = dataset["test"]

# This creates an index over your train set. All records are stored in their entirety.
semhash = SemHash.from_records(records=train, columns=["text"])
result = semhash.self_deduplicate(threshold=0.9)

Let’s dive into what you can do with the result. First off, you can just get all deduplicated records:

result.deduplicated

These records are exactly the records you put in, allowing you to use semhash within other ML pipelines. semhash doesn’t change your data, it just reduces it in size.

You can easily see the proportion of records that were duplicates:

result.duplicate_ratio

or exact duplicates:

result.exact_duplicate_ratio

You can also see what got marked as a duplicate, and why. Each duplicated document gets stored together with the examples from the index that caused it to be marked as such. Exact duplicates get marked as such. The following code example demonstrates basic usage.

for duplicated_record in results.duplicates:
    print(duplicated_record.record)
    if duplicated_record.exact:
        print("Exact match")
        continue
    for index_duplicate in duplicated_record.duplicates:
        print(index_duplicate)
    print("-" * 25)

For ease of use, we also provide a helper function that shows you the least similar deduplication record in your set of duplicates:

result.get_least_similar_from_duplicates(1)

If this record still makes a lot of sense to be called a duplicate with reference to the record it duplicated, your duplication strategy makes sense! If it doesn’t you can choose to re-threshold your result set. By doing this, you create a new threshold, thereby removing duplicates. As follows:

print(result.duplicate_ratio)
result.rethreshold(0.95)
print(result.duplicate_ratio)

So, a general strategy could be to start with a relatively low threshold, unilt the results returned by result.get_least_similar_from_duplicates start making sense. In our experiments, however, a threshold if 0.9, which is the default, works fine, but be sure to check for your individual use-cases.

Multi-column data

semhash also supports multi-column datasets, allowing you to deduplicate datasets that have text in multiple columns. For example, in QA datasets, you don’t just want to deduplicate similar questions or similar contexts, but you want to only count items in which both fields are sufficiently similar as duplicated.

This is a difficult problem to tackle, but semhash can also handle this.

The following snippet demonstrates how this works:

from datasets import load_dataset

from semhash import SemHash

dataset = load_dataset("rajpurkar/squad_v2")
train = dataset["train"]

# This creates an index over your train set. All records are stored in their entirety.
semhash = SemHash.from_records(records=train, columns=["context", "question"])
result = semhash.self_deduplicate(threshold=0.9)

This computes the similarity and only returns records for which both fields are similar.

Conclusion

Semhash is great! Get semhash here!

POTION: bag of tricks leads to better models

Tue, 29 Oct 2024 00:00:00 GMT

This blogpost describes the Tokenlearn method, which is a method to pre-train Model2Vec models.

We’ve been brewing, concocting, distilling, and came up with a new distillation technique that leads to much better models, which we are now releasing under the name POTION. We open source all models, code, and data.

We’re releasing three versions: a 64-dim (1.9M params), 128-dim (3.8M params), and 256-dim (7.6M params) model, all based on the same base model, which is, in turn, a bge-base distillation. All POTION models outperform all previous distillations in their size class, and should be considered to be drop-in replacements of our M2V_base_output model. potion-base-8M, in particular, even improves over our largest model, M2V_base_glove. potion-base-8M is better than any set of static embeddings we could find on any task, including glove, fasttext and specialized word embeddings.

Get them here:

The Tokenlearn code can be found here.

The rest of the post will detail how we made the models, how they perform, and further improvements we have in store.

Distillation

In our regular model2vec framework we distill sentence transformers down to really fast tiny models by doing a forward pass for all tokens separately. We then perform Principal Component Analysis (PCA) on the resulting embeddings, and weigh the individual embeddings via Zipf’s law. See our previous blog post here. The new distillation framework is composed of 4 steps.

Model2Vec distillation
Sentence transformer inference
Training
Post-training regularization

These four steps take a bit longer than the previous distillation framework. If you are looking for a quick way to get a model2vec model, distillation is still your best bet. If you are looking for maximum performance, read on!

1. Distillation

We start from a distilled model. In our case, we are using the M2V_base_output model as our starting point.

2. Sentence transformer inference

We then go back to the original big sentence transformer, and use that transformer to create ~1M embeddings on an in-domain corpus, which for us is C4. We then throw away the sentence transformer, never to see it again. Forget it existed.

3. Training

So, we now have a base model, and 1M texts and 1M vector representations of those texts. We then train the base model to minimize the cosine distance between the representations it produces and the representations we produced before. In doing so, our model learns to better mimic representations made by a large model. We also add a super heavy regularization term to the produced embeddings.

During training, we apply a few standard methods to improve performance, such as reducing the learning rate on plateau, and early stopping.

4. Post-training re-regularization

Finally, after training, we re-regularize our models by performing PCA, and by manually re-weighting individual tokens.

Of note here is the manual re-weighting, which is very similar to the Zipf weighting we use, but now relies on external data. Before, we assumed that all tokens were in rank order, and simply weighted them as follows:

w = log(1 / rank)

This works really well, as shown in our original blog post. Using actual frequencies, however, works even better. We use the same 1M documents on which we trained, and collect token probabilities for all tokens in our vocabulary. We then reweight using the following formula from the SIF paper:

w = 1e-3 / (1e-3 + proba)

where proba is the probability of the token in the corpus. While this does mean our new distillation method relies on some data, it is worth it, as we will show below.

Results

Just like in our original experiments, we again evaluate on MTEB, as well as our two additional tasks (PEARL and WordSim). The results are shown in the table below.

Model	Avg (All)	Avg (MTEB)	Class	Clust	PairClass	Rank	Ret	STS	Sum	Pearl	WordSim
all-MiniLM-L6-v2	56.08	56.09	62.62	41.94	82.37	58.04	41.95	78.90	30.81	60.83	49.91
potion-base-8M	50.54	50.03	64.44	32.93	76.62	49.73	31.71	73.24	29.28	53.54	50.75
M2V_base_glove_subword	49.06	46.69	61.27	30.03	74.71	49.15	27.16	69.09	30.08	56.82	57.99
potion-base-4M	48.87	48.23	62.19	31.47	75.37	48.75	29.11	72.19	28.89	52.55	49.21
M2V_base_glove	48.58	47.6	61.35	30.52	75.34	48.5	29.26	70.31	31.5	50.28	54.29
M2V_base_output	46.79	45.34	61.25	25.58	74.9	47.63	26.14	68.58	29.2	54.02	49.18
potion-base-2M	45.52	44.77	58.45	27.5	73.72	46.82	24.13	70.14	31.51	50.82	44.72
GloVe_300d	42.84	42.36	57.31	27.66	72.48	43.3	22.78	61.9	28.81	45.65	43.05
BPEmb_50k_300d	39.34	37.78	55.76	23.35	57.86	43.21	17.5	55.1	29.74	47.56	41.28

As can be seen, potion-base-8M is the best model we have released so far (surpassing the 50% average MTEB score mark!), further pushing the limits of what is possible with static word embeddings. Furthermore, the 4M and 2M models still work quite well, with the 2M model outperforming GloVE while being ~55 times smaller.

To show the relationship between the number of sentences per second and the average MTEB score, we plot the average MTEB score against sentences per second. The circle sizes correspond to the number of parameters in the models (larger = more parameters).

The average MTEB score plotted against sentences per second. The circle size indicates model size.

Model2Vec Introduction blogpost

Mon, 14 Oct 2024 00:00:00 GMT

This blog was first posted on the Hugging Face blog. We’re also posting it here for archival purposes.

Model2Vec: Distill a Small Fast Model from any Sentence Transformer

(Large) language models have become the de facto standard for feature extraction. While these models have shown state-of-the-art performance on a large number of tasks they also come with heavy resource requirements: large energy consumption, computational demands, and longer processing times. Although there are many ways in which you can make existing (Sentence) Transformers faster, e.g. quantization, or specialized kernels, they are still relatively slow, especially on CPU. What if you need to go faster and are working on a time-constrained product (e.g. a search engine), or have very little resources available?

This is where Model2Vec comes in — offering static embeddings that are hardware and eco-friendly while maintaining strong performance.

In this blog, we will discuss what Model2Vec is, how it works, how you can use it, and its performance.


Visualization of the Model2Vec architecture.

What is model2vec?
How to use model2vec
Results
Conclusion
Acknowledgements

What is Model2Vec?

Model2Vec is a technique to distill a small, fast, high performance static model from any Sentence Transformer. At a high level, it works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. No dataset is needed, just a model (and optionally, a vocabulary). During inference, we simply take the mean of all token embeddings occurring in a sentence. A Model2Vec model is therefore completely uncontextualized. While this may sound like a big downside, we’ll show that it still performs quite well considering how small and fast it is.

The above might sound like a lot to you, so let’s unpack this a little.

Transformers and embeddings

In a sentence transformer encoding step, a string is first chopped up into subword tokens. The embeddings of these tokens are then fed through the model, which contextualizes them to create high-quality sentence representations. At the output, you get as many embeddings as you put in, so if your input sentence consists of 10 tokens, you also get 10 output tokens. These tokens are then turned into a sentence representation by a pooling mechanism, which can either be a simple mean, or a special pooler module.

On to Model2Vec: the project first started as a kind of cache for sentence transformers. Because a transformer vocabulary typically only has about 32k tokens, a word like astoundingly gets chopped up into four unique tokens: 'as', '##tou', '##nding', '##ly', which means that we re-compute the attention between those four tokens each time this word occurs. But the meaning of this word might not be ambiguous at all!

However, as we started implementing this, we noticed that you actually do not need to cache any words at all, and you can just use the output representations of individual tokens to get good sentence representations. And this is exactly what the basic mode of operation of Model2Vec is: for each of the 32k input tokens in a sentence transformer vocabulary, we do a forward pass, and then store the resulting embedding. For a new sentence, we then just take the mean of the token embeddings we computed.

Note that the output token representations of a model2vec model are uncontextualized. Unlike with normal transformer models, there is no way for the model to give different meanings to the same token in different contexts. While this might seem like a huge downside, we think that the actual context provides models with enough disambiguation potential.

In addition to this trick, we show that two additional tricks are necessary to get optimal performance.

PCA

We reduce the dimensionality of the resulting token space by using Principal Component Analysis (PCA). Normally, using PCA is associated with a loss in performance, because you throw away information. However, in our case, reducing the dimensionality actually increased performance significantly. We think this is because PCA also normalizes the resulting space, in the sense of removing biases in the original vector space, thereby making it easier to learn from the vectors.

Zipf

As we take a simple mean over tokens in the space, it is important that the vectors are weighted correctly. Normally, a sentence transformer would be there to correctly weight all the tokens for us given the context, but we don’t have that luxury any more. Intuitively, we would like to use something like Inverse Document Frequency (IDF) to down-weight very frequent or uninteresting words. But we don’t have access to a corpus over which to compute document frequencies.

To overcome this, we opt to use a well-known principle from language sciences, which is that, given a frequency ranked list, the frequency of the items in that list follow a power law distribution. This is called Zipf’s law. So, if we take the assumption that a vocabulary is ranked by frequency, we can accurately down-weight really frequent items without needing to have access to actual frequencies. As tokenizer vocabularies are sorted by frequency, we already have access to a ranked list, so this optimization can be applied without any additional work.


Visualization of the effects of applying PCA and Zipf weighting on the embeddings.

Usage

The Model2Vec library has two broad modes of usage: distillation and inference. In distillation mode, you can distill your own model using any Sentence Transformer (and optionally your own vocabulary). In inference mode, you can use the distilled model (or use one of our pre-distilled models) to generate embeddings for your text data at extremely high speed.

There are three ways to distill a model:

Output: behaves much like a real sentence transformer, i.e., it uses a subword tokenizer and simply encodes all wordpieces in its vocabulary. This is really quick to create (30 seconds on a CPU), very small (30 MB in float32), but might be less performant on some tasks.
Vocab (word): In this mode, you can pass your own vocabulary to create representations. This allows you to create good representations for whatever in-domain data you have, and is a drop-in replacement for GloVe or word2vec.
Vocab (subword): In this mode, you can pass your own vocabulary, but it also uses the subword vocabulary to create representations. This allows you to create good representations for whatever in-domain data you have.

Note that, while vocabulary-based models are larger in terms of RAM, all models are equally fast, because our model is independent of vocabulary size.

Model2Vec embeddings can be used in a wide variety of applications, such as text classification, clustering, building a search engine, or a RAG system. They are an especially good fit for applications that require fast, lightweight embeddings with low resource requirements.

As we will show next, Model2Vec is very easy to use. It can either be used as a standalone package, or used directly in Sentence Transformers. This means you can easily integrate it into any pipeline that supports Sentence Transformers (e.g. LangChain and LlamaIndex). You can also train model2vec models directly using Sentence Transformers, keeping the fast inference speed, but optimizing them directly for your use case.

How to use Model2Vec

Installation

Model2Vec can be installed using pip:

pip install model2vec

Usage

Inference

The easiest way to get started with Model2Vec is to download one of our flagship models from our HuggingFace hub. These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings:

from model2vec import StaticModel

# Load a model from the HuggingFace hub (in this case the M2V_base_output model)
model_name = "minishlab/M2V_base_output"
model = StaticModel.from_pretrained(model_name)

# Make embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

Or distill your own models and directly use them:

from model2vec import distill

# Choose a Sentence Transformer model
base_model_name = "BAAI/bge-base-en-v1.5"

# Distill an output model with the chosen dimensions
model = distill(model_name=base_model_name, pca_dims=256)

# Make embeddings
embeddings = model.encode(["supervillain Ganondorf has invaded Hyrule!"])

print(model.tokenizer.encode("supervillain Ganondorf has invaded Hyrule!", add_special_tokens=False).tokens)
# ['super', '##vill', '##ain', 'gan', '##ond', '##orf', 'has', 'invaded', 'h', '##yr', '##ule', '!']

# It looks like we split Ganondorf and Hyrule up into many subtokens
# To solve this, we can add these words to our vocabulary.
vocabulary = ["supervillain", "ganondorf", "hyrule"]

# Distill the model with the custom vocabulary.
model = distill(model_name=base_model_name, vocabulary=vocabulary, pca_dims=256)

print(model.tokenizer.encode("supervillain Ganondorf has invaded Hyrule!", add_special_tokens=False).tokens)
# ['supervillain', 'ganondorf', 'has', 'invaded', 'hyrule', '!']
# Much better.

Model2Vec is also directly supported in Sentence Transformers. To use Model2Vec in Sentence Transformers, you can initialize a StaticEmbedding class using from_model2vec. To directly distill in Sentence Transformers, the StaticEmbedding class can be initialized using from_distillation:

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

# Initialize a StaticEmbedding module using a pre-trained model
static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_base_output")
model = SentenceTransformer(modules=[static_embedding])
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

# Or distill your own directly without leaving sentence-transformers
static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256)
model = SentenceTransformer(modules=[static_embedding])
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

Results

We evaluated Model2Vec on a large number of tasks and datasets. Model2Vec is evaluated on MTEB, as well as two additional tasks: PEARL (a phrase representation task) and WordSim (a collection of word similarity tasks). The results are shown in the table below.

Model	Avg (All)	Avg (MTEB)	Class	Clust	PairClass	Rank	Ret	STS	Sum	Pearl	WordSim
all-MiniLM-L6-v2	56.08	56.09	62.62	41.94	82.37	58.04	41.95	78.90	30.81	60.83	49.91
M2V_base_glove_subword	49.06	46.69	61.27	30.03	74.71	49.15	27.16	69.09	30.08	56.82	57.99
M2V_base_glove	48.58	47.60	61.35	30.52	75.34	48.50	29.26	70.31	31.50	50.28	54.29
M2V_base_output	46.79	45.34	61.25	25.58	74.90	47.63	26.14	68.58	29.20	54.02	49.18
GloVe_300d	42.84	42.36	57.31	27.66	72.48	43.30	22.78	61.90	28.81	45.65	43.05
BPEmb_50k_300d	39.34	37.78	55.76	23.35	57.86	43.21	17.50	55.10	29.74	47.56	41.28

As can be seen, Model2Vec significantly outperforms GloVe and BPEmb on all tasks, and even outperforms MiniLM, which is a much slower model, on some tasks.

In addition, we evaluated Model2Vec on a number of classification datasets that are not in MTEB. We also use these to benchmark the speed of the model. The results are shown in the table below.

Model	Average	SST2	IMDB	TREC	AG News
bge-base-en-v1.5	90.00	91.54	91.88	85.16	91.45
all-MiniLM-L6-v2	84.10	83.95	81.36	81.31	89.77
M2V_base_output	82.23	80.92	84.56	75.27	88.17
M2V_base_glove_subword	81.95	82.84	85.96	70.51	88.49
BPEmb_50k_300d	81.15	80.42	84.04	71.25	88.92
M2V_base_glove	80.76	83.07	85.24	66.12	88.61
GloVe_300d	77.77	81.68	84.00	55.67	89.71

Again, Model2Vec outperforms GloVe BPEmb on all tasks, and even shows similar performance to MiniLM.

The figure below shows the relationship between the number of sentences per second and the average classification score. The circle sizes correspond to the number of parameters in the models (larger = more parameters). This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the all-MiniLM-L6-v2 model.


The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size.

Ablations

To better understand the factors contributing to the performance of Model2Vec, we conducted a comprehensive set of ablation studies, covering various aspects of the model’s architecture and preprocessing methods. In these studies, we examined the impact of key elements such as PCA, Zipf weighting, and the use of Sentence Transformers versus regular transformer models. We also compared the performance of input embeddings versus output embeddings, since it would seem plausible that these should also work well. The results are shown in the table below.

Model	Avg (All)	Avg (MTEB)	Class	Clust	PairClass	Rank	Ret	STS	Sum	Pearl	WordSim
M2V_base_output	46.79	45.34	61.25	25.58	74.9	47.63	26.14	68.58	29.2	54.02	49.18
M2V_base_output_nopca	44.04	42.31	61.42	20.15	68.21	44.67	25.25	61.87	29.85	51.02	48.96
M2V_base_output_nozipf	43.61	41.52	60.44	21.62	72.15	45.57	20.35	62.71	30.66	52.28	49.17
M2V_base_input_nozipf_nopca	40.97	39.55	54.16	18.62	68.3	43.65	23.63	59.38	32.04	50.19	40.52
M2V_base_output_nozipf_nopca	40.8	38.44	59.78	19.31	62.39	42.26	19.01	55.16	30	49.09	48.97
M2V_base_input	40.74	39.93	60.35	22.66	59.63	43.02	25.47	50.05	29.35	50.61	34.47
M2V_bert_output_nozipf_nopca	35.54	34.82	55.69	15.42	58.68	39.87	12.92	55.24	30.15	46.9	26.72

There’s four main findings in these results:

Non-Sentence Transformers do not work well. This can be seen by comparing M2V_bert_output_nozipf_nopca (which uses BERT, a non-Sentence Transformer) and M2V_base_output_nozipf_nopca (which uses BGE-base, a Sentence Transformer). Using a Sentence Transformer gives a ~5.2% increase in performance.
PCA is crucial for performance. This can be seen by comparing M2V_base_output_nozipf_nopca and M2V_base_output_nozipf which gives a ~2.8% increase in performance. Furthermore, PCA improves performance on all tasks.
Zipf weighting is crucial for performance. This can be seen by comparing M2V_base_output_nozipf_nopca and M2V_base_output_nopca which gives a ~3.1% increase in performance.
Output embeddings outperform input embeddings. This can be seen by comparing M2V_base_input and M2V_base_output which gives a ~6.1% increase in performance. Note that input embeddings do work well for some tasks. We hypothesize that this is because input embeddings are inherently normalized.

Conclusion

Thanks for reading our blog post on Model2Vec! We hope you found it informative and useful. If you have any questions or comments, please feel free to reach out to us. We are still actively working on the project, and have a number of features already planned, so stay tuned.

Citing

@software{minishlab2024word2vec,
  authors = {Stephan Tulkens, Thomas van Dongen},
  title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
  year = {2024},
  url = {https://github.com/MinishLab/model2vec},
}

Acknowledgements

We’d like to thank Tom Aarsen for integrating Model2Vec into Sentence Transformers and helping us with our HuggingFace integration, as well as his general feedback on the project.

Minish | Blog

Model2Vec Size Improvements

Introduction

Overview

1. PCA

2. Quantization

3. Vocabulary quantization

Results

Model2Vec as a fasttext alternative

Classification

Performance

Training time

Inference time

Model size

Zero shot (MTEB)

Results

Conclusion

Tokenlearn 0.2.0

Why use tokenlearn?

How does it work?

What does it do?

Differences between the new and old tokenlearn

New improvements to model2vec distillation

Improvements

Basic

1. Pooling

2. SIF weighting

3. Normalization

Taking stock

Where to go from here?

ModernBERT support and why it doesn't work

Distilling ModernBERT

WordSim

Fixes we tried

Future work

semhash: deduplication and dataset multitool

Introduction

How can I use deduplication?

Classification

RAG systems

Explain your corpus

How does it work?

Explainability

Multi-column data

Conclusion

POTION: bag of tricks leads to better models

Distillation

1. Distillation

2. Sentence transformer inference

3. Training

4. Post-training re-regularization

Results

Model2Vec Introduction blogpost

Model2Vec: Distill a Small Fast Model from any Sentence Transformer

Table of Contents

What is Model2Vec?

Transformers and embeddings

PCA

Zipf

Usage

How to use Model2Vec

Installation

Usage

Inference

Results

Ablations

Conclusion

Citing

Acknowledgements