<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Minish | Blog</title><description>Fast open-source NLP models and packages</description><link>https://minish.ai/</link><language>en</language><item><title>Model2Vec Size Improvements</title><link>https://minish.ai/blog/2025-10-05-size-blogpost/</link><guid isPermaLink="true">https://minish.ai/blog/2025-10-05-size-blogpost/</guid><description>In this blogpost, we showcase the various size reduction techniques we implemented in Model2Vec, and how they can be combined to create tiny models (~6mb) with minimal performance loss.

</description><pubDate>Sun, 05 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Over the past year, we’ve implemented several ways to reduce the size of Model2Vec models. Due to the nature of our distillation technique, Model2Vec distilled models are already relatively compact, but we can make them even smaller (~6mb), as we will show in this blogpost.
This can be beneficial for deployment in resource-constrained environments such as edge and mobile devices, where memory and storage are limited.
It also means we can load models faster, and serve more models at the same time.&lt;/p&gt;
&lt;p&gt;Since all the parameters in a Model2Vec model are in the embedding matrix, we can reduce size in three ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;By reducing the dimensionality of the embeddings&lt;/li&gt;
&lt;li&gt;By reducing the precision of the embeddings (quantization)&lt;/li&gt;
&lt;li&gt;By reducing the number of embeddings (the vocabulary size)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With our latest release, we can now directly modify all of these in Model2Vec. Let’s go over them one by one!&lt;/p&gt;
&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;/h2&gt;
&lt;p&gt;We use the following three techniques to reduce model size:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Principal Component Analysis (PCA) (available since our initial release)&lt;/li&gt;
&lt;li&gt;Quantization (available since v0.5.0)&lt;/li&gt;
&lt;li&gt;Vocabulary Quantization (our shiny new feature which we just released in v0.7.0)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;1-pca&quot;&gt;1. PCA&lt;/h3&gt;
&lt;p&gt;The first and most straightforward way to reduce model size is dimensionality reduction, which we do with PCA. Most embedding models operate at high dimensions (e.g. 768), which is a lot more than we (usually) need for static embedding models.&lt;/p&gt;
&lt;h3 id=&quot;2-quantization&quot;&gt;2. Quantization&lt;/h3&gt;
&lt;p&gt;Next up is quantization. By default, embeddings are stored as 32-bit floats. By quantizing them to 16-bit floats, or even 8-bit integers, we can cut storage requirements by 2x-4x.&lt;/p&gt;
&lt;h3 id=&quot;3-vocabulary-quantization&quot;&gt;3. Vocabulary quantization&lt;/h3&gt;
&lt;p&gt;Finally, we can modify the vocabulary itself. Large vocabularies are expensive: every token needs its own vector. But many tokens are rare, and some are near-duplicates. With vocabulary quantization, we cluster embeddings using k-means and merge them, effectively compressing the vocabulary without throwing away coverage.&lt;/p&gt;
&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;
&lt;p&gt;Here’s how the different strategies stack up. For these experiments, we start with a distilled &lt;a href=&quot;https://huggingface.co/BAAI/bge-base-en-v1.5&quot;&gt;bge-base-en-v1.5&lt;/a&gt; model using default parameters (baseline).&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Model&lt;/th&gt;&lt;th&gt;Size&lt;/th&gt;&lt;th&gt;Average (MTEB)&lt;/th&gt;&lt;th&gt;Drop vs. Baseline&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Baseline (768d, FP32)&lt;/td&gt;&lt;td&gt;92 MB&lt;/td&gt;&lt;td&gt;46.69&lt;/td&gt;&lt;td&gt;–&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;+ PCA (256d)&lt;/td&gt;&lt;td&gt;32 MB&lt;/td&gt;&lt;td&gt;46.63&lt;/td&gt;&lt;td&gt;-0.06&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;+ Quantization (INT8)&lt;/td&gt;&lt;td&gt;9 MB&lt;/td&gt;&lt;td&gt;46.60&lt;/td&gt;&lt;td&gt;-0.09&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;+ Vocab quantization (20k clusters)&lt;/td&gt;&lt;td&gt;6 MB&lt;/td&gt;&lt;td&gt;45.99&lt;/td&gt;&lt;td&gt;-0.70&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;As you can see, we can shrink a &lt;strong&gt;92 MB&lt;/strong&gt; model down to &lt;strong&gt;6 MB (15x smaller!)&lt;/strong&gt; while losing less than 1% performance on MTEB.
Another interesting thing to see is that PCA and quantization have a very small effect on performance, and can essentially be applied without any trade-offs.
Note that the vocabulary of the used base model is already quite small (~30k tokens). We expect vocabulary quantization to have a bigger effect on models with larger vocabularies (e.g. multilingual models), which we will explore in future work.&lt;/p&gt;
&lt;p&gt;As always, we’d love to hear your feedback — let us know what you’re building with these tiny models, and if you want to try this yourself, grab the latest &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;Model2Vec release&lt;/a&gt;!&lt;/p&gt;</content:encoded></item><item><title>Model2Vec as a fasttext alternative</title><link>https://minish.ai/blog/2025-07-28-fasttext/</link><guid isPermaLink="true">https://minish.ai/blog/2025-07-28-fasttext/</guid><description>In this blogpost, we compare Model2Vec and fastText. We show that Model2Vec is faster, smaller, and more performant.

</description><pubDate>Mon, 28 Jul 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Model2Vec is typically viewed as a fast alternative to a sentence transformer. There’s a good reason for that, because:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Model2Vec models are distilled versions of sentence transformers&lt;/li&gt;
&lt;li&gt;Model2Vec models are drop-in replacements for sentence transformers&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Having said that, a better comparison would actually be Meta’s &lt;a href=&quot;https://fasttext.cc/&quot;&gt;fasttext&lt;/a&gt;. Like Model2Vec, fasttext can be used to create static vectors, and can also be used to create classifiers using a set of static vectors as a starting point. In this short blog post, we’ll show that off-the-shelf Model2Vec models are much better than fasttext classifiers and word vectors. So, if you’re currently using fasttext somewhere, consider a comparison to Model2Vec!&lt;/p&gt;
&lt;h1 id=&quot;classification&quot;&gt;Classification&lt;/h1&gt;
&lt;p&gt;To test the classification efficacy of Model2Vec in comparison to fasttext, we ran experiments on 15 datasets from the &lt;a href=&quot;https://huggingface.co/SetFit&quot;&gt;setfit organization on Hugging Face&lt;/a&gt;. We initialized the fasttext classifier using the &lt;code dir=&quot;auto&quot;&gt;wiki-news-300d-1M-subword.vec&lt;/code&gt; vectors we got &lt;a href=&quot;https://fasttext.cc/docs/en/english-vectors.html&quot;&gt;here&lt;/a&gt;, and used the nltk &lt;code dir=&quot;auto&quot;&gt;word_tokenize&lt;/code&gt; function as a tokenizer. The Model2Vec model was initialized from &lt;a href=&quot;https://huggingface.co/minishlab/potion-base-32M&quot;&gt;&lt;code dir=&quot;auto&quot;&gt;minishlab/potion-base-32m&lt;/code&gt;&lt;/a&gt;. All models were trained with sensible defaults. We optimized the tokenization for the fasttext model.&lt;/p&gt;
&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
&lt;p&gt;Here’s the full table for both approaches:&lt;/p&gt;


























































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;dataset_name&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Model2Vec&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;fasttext&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;20_newsgroups&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;66.18&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;57.11&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;ade&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;88.05&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;86.14&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;ag_news&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;91.61&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;92.18&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;amazon_counterfactual&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;80.64&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;82.58&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;bbc&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;96.93&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;95.86&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;emotion&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;79.7&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;79.35&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;enron_spam&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;98.85&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;98.85&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;hatespeech_offensive&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;70.94&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;69.48&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;imdb&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;87.79&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;89.51&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;massive_scenario&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;88.78&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;87.26&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;senteval_cr&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;79.12&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;76.2&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;sst5&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;41.49&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;19.59&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;student&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;93.76&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;93.12&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;subj&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;91.94&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;92.6&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;tweet_sentiment_extraction&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;74.14&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;68.98&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Average&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;81.99&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;79.25&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Model2Vec outperforms fasttext on average, although with smaller gains than we anticipated.&lt;/p&gt;
&lt;h2 id=&quot;training-time&quot;&gt;Training time&lt;/h2&gt;
&lt;p&gt;Fasttext models train faster than Model2Vec models. Note that this just concerns the actual training of the supervised classifier; we don’t include any pretraining time. One observation is that Model2Vec models tend to train for a bit too long. Also, the training time is less than a minute for both approaches, so…&lt;/p&gt;
&lt;h2 id=&quot;inference-time&quot;&gt;Inference time&lt;/h2&gt;
&lt;p&gt;Model2Vec processes about 14.6k samples per second, while fasttext processes about 3.6k.&lt;/p&gt;
&lt;p&gt;This with the caveat that the inference time of both approaches is difficult to compare, since almost all of the time for both models is actually spent in the tokenizer. Disabling any preprocessing for fasttext makes it faster than Model2Vec (3.6k -&gt; 25k (!) samples/second), albeit with a hit to performance (79.5 -&gt; 78.5 average score). This underscores one of the painful issues of older NLP approaches, which is that preprocessing/tokenization matters a lot, and that it is difficult to find alignment between your tokenization and models found online.&lt;/p&gt;
&lt;h2 id=&quot;model-size&quot;&gt;Model size&lt;/h2&gt;
&lt;p&gt;The trained Model2Vec model is only 130 MB on disk, while the fasttext model is substantially larger at 2.1 gigabytes. Note, however, that both Model2Vec and fasttext can be compressed through quantization.&lt;/p&gt;
&lt;h1 id=&quot;zero-shot-mteb&quot;&gt;Zero shot (MTEB)&lt;/h1&gt;
&lt;p&gt;fasttext vectors can also be used as word2vec embeddings. As such, we can test how well they work on the &lt;a href=&quot;https://huggingface.co/spaces/mteb/leaderboard&quot;&gt;Massive Text Embedding Benchmark (MTEB)&lt;/a&gt; as a zero shot embedding approach, comparing directly with a Model2Vec model. Following the above, we use the nltk &lt;code dir=&quot;auto&quot;&gt;word_tokenize&lt;/code&gt; function to tokenize text going into fasttext, and also normalize all the output vectors to unit length. We perform no additional preprocessing for Model2Vec. Because running MTEB can take a long time, we don’t run all subsets. We use the &lt;a href=&quot;https://huggingface.co/spaces/mteb/leaderboard&quot;&gt;original MTEB benchmark&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;
&lt;p&gt;It’s honestly not looking too good for fasttext. Model2Vec blows it out of the water at all tasks, excluding WordSim, which is a set of word similarity tasks.&lt;/p&gt;













































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;fasttext&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Model2Vec&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Classification&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;51.97&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;65.97&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Clustering&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;22.25&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;35.29&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;PairClassification&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.89&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;78.17&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Reranking&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;40.7&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.92&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;STS&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.2&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;74.22&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Summarization&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.41&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.78&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;WordSim&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;59.29&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;55.15&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The only task on which fasttext performs well is &lt;code dir=&quot;auto&quot;&gt;Wordsim&lt;/code&gt;, which is a collection of lexical similarity tasks. This is interesting, because these kinds of datasets were popular around the time fasttext, GloVe, word2vec, and other static methods were initially created. So this could be one of the reasons these vectors work well: methods are developed with reference to the evaluation data that is available at the time of development.&lt;/p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;If you’re still using fastText, be it for classification or for word embeddings, it’s probably time for an upgrade. Model2Vec offers smaller models, faster inference, and better downstream results in most cases. Give it a try, and benchmark for yourself!&lt;/p&gt;</content:encoded></item><item><title>Tokenlearn 0.2.0</title><link>https://minish.ai/blog/2025-05-31-tokenlearn-release/</link><guid isPermaLink="true">https://minish.ai/blog/2025-05-31-tokenlearn-release/</guid><description>We’ve released a new version of Tokenlearn! It contains usability improvements, fixes some bugs, and has a new learning algorithm under the hood that improves performance. Read on to see what it does and how you can use it.

</description><pubDate>Sat, 31 May 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;We’ve released a new version of tokenlearn! It contains usability improvements, fixes some bugs, and has a new learning algorithm under the hood that improves performance. Read on to see what it does and how you can use it.&lt;/p&gt;
&lt;h2 id=&quot;why-use-tokenlearn&quot;&gt;Why use tokenlearn?&lt;/h2&gt;
&lt;p&gt;Tokenlearn is a way to improve a distilled Model2Vec model by performing an additional knowledge distillation step using the base model (the sentence transformer you distilled) and a distilled Model2Vec model. The Model2Vec model is trained to directly mimic the vectors produced by the base model, which leads to massive improvements. Notably, this does not require any labeled data.&lt;/p&gt;
&lt;p&gt;As an example: our new tokenlearn version was used to train our multilingual flagship model, &lt;a href=&quot;https://huggingface.co/minishlab/potion-multilingual-128M&quot;&gt;potion-multilingual-128M&lt;/a&gt;. This model performs at about the same level as &lt;a href=&quot;https://huggingface.co/sentence-transformers/static-similarity-mrl-multilingual-v1&quot;&gt;static-similarity-mrl-multilingual-v1&lt;/a&gt; (which we will call MRL). The main difference between the two is how they were trained: MRL has been trained on 8.5 million cross-lingually aligned sentence pairs, while potion-multilingual has only been trained on &lt;em&gt;2 million random C4 passages&lt;/em&gt;. This shows the power of tokenlearn! You can adapt any Model2Vec model to a specific domain with a small number of short documents, no annotations needed.&lt;/p&gt;
&lt;h2 id=&quot;how-does-it-work&quot;&gt;How does it work?&lt;/h2&gt;
&lt;p&gt;Before starting on what’s new, let’s first go into how you can use tokenlearn. First, you need to select a base model, i.e., a sentence transformer you like using, and a dataset from which you will sample passages. For this, you need to use the &lt;code dir=&quot;auto&quot;&gt;featurize&lt;/code&gt; function.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; datasets &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; load_dataset&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; sentence_transformers &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; SentenceTransformer&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; tokenlearn.featurize &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; featurize&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;my_corpus &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;load_dataset&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;allenai/c4&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;en&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;split&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;train&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;streaming&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;True&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;SentenceTransformer&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;baai/bge-base-en-v1.5&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;output_dir &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;my_corpus_featurized&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;featurize&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;dataset&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;my_corpus&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;output_dir&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;output_dir&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;max_means&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;2_000_000&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;batch_size&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;32&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;text_key&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;text&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Leave this running for a while, and you will get a set of documents and means in &lt;code dir=&quot;auto&quot;&gt;output_dir&lt;/code&gt;. Note that this script can be resumed, if the arguments are the same, the embedding computation will pick up where you left it. Now that you have the documents in this directory, you can fit a model on them.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; tokenlearn.train &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; train_model&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; tokenlearn.utils &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model_name &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;baai/bge-base-en-v1.5&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;data_dir &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;my_corpus_featurized&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;vocab_size &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;250_000&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Collect paths for training data&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;paths &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;sorted&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;Path&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;data_dir&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;span&gt;.&lt;/span&gt;&lt;span&gt;glob&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;*.json&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;))&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;train_txt, train_vec &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;collect_means_and_texts&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;paths&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;train_model&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;model_name&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;train_txt&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;train_vec&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;device&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;None&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;vocab_size&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;vocab_size&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;pca_dims&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;512&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Running this command will get you a trained potion-like model, specifically fit for your domain. Two relevant options to keep in mind are &lt;code dir=&quot;auto&quot;&gt;vocab_size&lt;/code&gt; and &lt;code dir=&quot;auto&quot;&gt;pca_dims&lt;/code&gt;. These control the number of rows and columns in your embedding matrix, respectively.&lt;/p&gt;
&lt;p&gt;In general, setting &lt;code dir=&quot;auto&quot;&gt;pca_dims&lt;/code&gt; to 256 or 512 should be good enough for most problems, and depends on the explained variance of your target vectors.&lt;/p&gt;
&lt;p&gt;Setting the &lt;code dir=&quot;auto&quot;&gt;vocab_size&lt;/code&gt; parameter is more complicated. If &lt;code dir=&quot;auto&quot;&gt;vocab_size&lt;/code&gt; is &gt; 0, we tokenize all texts before training, and select &lt;code dir=&quot;auto&quot;&gt;vocab_size&lt;/code&gt; words to add to the vocabulary of the distilled model based on their frequency. Whether this is useful really depends on the size of your training corpus, and how well it matches with your downstream task. If there’s a lot of lexical overlap between the two, you can see a large improvement in performance, although at significant memory costs, as each added vocabulary item adds a whole row to your embedding matrix. Even setting &lt;code dir=&quot;auto&quot;&gt;vocab_size&lt;/code&gt; to 0 will improve performance over a raw distill, however.&lt;/p&gt;
&lt;h2 id=&quot;what-does-it-do&quot;&gt;What does it do?&lt;/h2&gt;
&lt;p&gt;In short, &lt;code dir=&quot;auto&quot;&gt;tokenlearn&lt;/code&gt; training:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;distills a Model2Vec model for you from a base model&lt;/li&gt;
&lt;li&gt;adds vocabulary (if any) to the vocabulary of your model&lt;/li&gt;
&lt;li&gt;perform PCA on the target embeddings we made using the base model&lt;/li&gt;
&lt;li&gt;perform knowledge distillation from the Model2Vec model to the target embeddings&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The knowledge distillation step is extremely simple: we simply reduce the Mean Squared Error (MSE) between the output vectors of the Model2Vec model and the output vectors of the base model, using a held-out set to perform early stopping. We separately optimize the embeddings and the norms of the static model, because we want to decouple the semantic of the token embeddings and the weight they have in a mean, and also want to encourage the model to pay attention to the weight each individual token has.&lt;/p&gt;
&lt;p&gt;Applying PCA to both the base model and output embeddings turns out to be extremely important. If this is not done, the knowledge distillation step does not work at all.&lt;/p&gt;
&lt;h2 id=&quot;differences-between-the-new-and-old-tokenlearn&quot;&gt;Differences between the new and old tokenlearn&lt;/h2&gt;
&lt;p&gt;In the old tokenlearn, we also applied a post-processing step, wherein we applied PCA over the learned weights, and then applied a &lt;a href=&quot;https://openreview.net/pdf?id=SyK00v5xx&quot;&gt;SIF-like&lt;/a&gt; transform to the embeddings. These steps are now no longer necessary.&lt;/p&gt;</content:encoded></item><item><title>New improvements to model2vec distillation</title><link>https://minish.ai/blog/2025-02-05-improvements/</link><guid isPermaLink="true">https://minish.ai/blog/2025-02-05-improvements/</guid><description>We’ve made a lot of improvements to Model2Vec since it came out, many of which target the baseline performance of our distillation process. In this post, we walk through each change and explain why it matters for making your models smaller and faster.

</description><pubDate>Wed, 05 Feb 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;We’ve made a lot of improvements to &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;model2vec&lt;/a&gt; since it came out, many of which target the baseline performance of our distillation process.&lt;/p&gt;
&lt;p&gt;This post details how the distillation process has changed over time, and how this has impacted baseline performance of model2vec models. Spoiler alert: if you’ve distilled a model a couple of months ago, it can really pay off to update model2vec and re-run the distillation process.&lt;/p&gt;
&lt;h1 id=&quot;improvements&quot;&gt;Improvements&lt;/h1&gt;
&lt;p&gt;Here are the improvements, in order of their appearance. In the last section, we’ll contrast all of them, and show their impact MTEB performance.&lt;/p&gt;
&lt;p&gt;For all experiments, we distill &lt;a href=&quot;https://huggingface.co/BAAI/bge-base-en-v1.5&quot;&gt;&lt;code dir=&quot;auto&quot;&gt;baai/bge-base-en-v1.5&lt;/code&gt;&lt;/a&gt; using the default parameters.&lt;/p&gt;
&lt;h2 id=&quot;basic&quot;&gt;Basic&lt;/h2&gt;
&lt;p&gt;As a reference, the basic operations we apply when distilling are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Token selection&lt;/em&gt;: we propagate all individual tokens through the model together with an EOS and BOS token, and then &lt;em&gt;select&lt;/em&gt; the middle token as the representation.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;PCA&lt;/em&gt;: apply PCA with a specific number of dimensions (256 for all models).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Zipf&lt;/em&gt;: we weight all individual tokens by estimating their frequency using &lt;a href=&quot;https://en.wikipedia.org/wiki/Zipf%27s_law&quot;&gt;Zipf’s law&lt;/a&gt;. The short of it is that we assume all tokens in the vocabulary are in &lt;em&gt;rank order&lt;/em&gt;, and that they follow a power law distribution.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We tried many variations on this theme, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Replacing PCA: we tried &lt;a href=&quot;https://en.wikipedia.org/wiki/Independent_component_analysis&quot;&gt;ICA&lt;/a&gt;, &lt;a href=&quot;https://umap-learn.readthedocs.io/en/latest/basic_usage.html&quot;&gt;umap&lt;/a&gt;, and &lt;a href=&quot;https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding&quot;&gt;T-SNE&lt;/a&gt;. All worked a lot worse.&lt;/li&gt;
&lt;li&gt;Using different propagation strategies: we tried not including BOS/EOS, either only BOS or only EOS, and pooling over the BOS token (i.e., &lt;code dir=&quot;auto&quot;&gt;[CLS]&lt;/code&gt; pooling).&lt;/li&gt;
&lt;li&gt;Using different weighting strategies, including TF-IDF.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of these really had the desired effect, but feel free to let us know if you come up with something else!&lt;/p&gt;
&lt;p&gt;The basic performance of our model with these strategies is on MTEB is &lt;strong&gt;45.34&lt;/strong&gt;. We released this model as &lt;a href=&quot;https://huggingface.co/minishlab/M2V_base_output&quot;&gt;m2v_base_output&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;1-pooling&quot;&gt;1. Pooling&lt;/h2&gt;
&lt;p&gt;As a first change: we switched from selecting the token to mean pooling, that is, the representation of a token is the mean of the &lt;code dir=&quot;auto&quot;&gt;EOS token BOS&lt;/code&gt; we pass forward through the network.&lt;/p&gt;
&lt;p&gt;In code:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Before:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;embedding &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;EOS token BOS&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;:, &lt;/span&gt;&lt;span&gt;1&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Now:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;embedding &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;model&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;EOS token BOS&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;).&lt;/span&gt;&lt;span&gt;mean&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;1&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;We also tried a variety of other pooling strategies, including selecting specific tokens, adding queries, and adding prompts.&lt;/p&gt;
&lt;p&gt;This raises the average score from &lt;strong&gt;45.34&lt;/strong&gt; to &lt;strong&gt;45.91&lt;/strong&gt;, but has a larger effect on models that don’t perform well to begin with, such as modernBERT-based models.&lt;/p&gt;
&lt;h2 id=&quot;2-sif-weighting&quot;&gt;2. SIF weighting&lt;/h2&gt;
&lt;p&gt;Following this, we replaced the Zipf weighting with a strategy based on the well-known &lt;a href=&quot;https://openreview.net/pdf?id=SyK00v5xx&quot;&gt;SIF algorithm&lt;/a&gt;. In short, this algorithm creates a probability distribution over all tokens in the vocabulary, and downweights very frequent tokens, while upweighting very infrequent tokens. For weighting, it uses the following formula:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;sif &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; alpha &lt;/span&gt;&lt;span&gt;/&lt;/span&gt;&lt;span&gt; (alpha &lt;/span&gt;&lt;span&gt;*&lt;/span&gt;&lt;span&gt; proba)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Where &lt;code dir=&quot;auto&quot;&gt;proba&lt;/code&gt; is a vector of token probabilities. As before, we use Zipf’s law to actually estimate the token probabilities, because we don’t actually have access to them. Applying this on top of the mean pooling raises the score from &lt;strong&gt;45.91&lt;/strong&gt; to &lt;strong&gt;47.40&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id=&quot;3-normalization&quot;&gt;3. Normalization&lt;/h2&gt;
&lt;p&gt;Normalization has been a part of model2vec from the very first version. This is a boolean flag that, when set to &lt;code dir=&quot;auto&quot;&gt;True&lt;/code&gt;, unit normalizes all output vectors. This is set to &lt;code dir=&quot;auto&quot;&gt;False&lt;/code&gt; by default, but this turns out to be a bad choice. Setting it to &lt;code dir=&quot;auto&quot;&gt;True&lt;/code&gt; has a significant positive effect, especially on retrieval and clustering, and raises the average score from &lt;strong&gt;47.40&lt;/strong&gt; to a whopping &lt;strong&gt;47.79&lt;/strong&gt;.&lt;/p&gt;
&lt;h1 id=&quot;taking-stock&quot;&gt;Taking stock&lt;/h1&gt;
&lt;p&gt;If you want more details, you can find the full table below. As you can see, the improvements we found are general, in the sense that they improve performance for all tasks except PEARL. Anecdotally, this also seems to hold for other models we tried.&lt;/p&gt;

























































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;m2v_base_output&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;+mean pooling&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;+sif&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;+norm&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Average (All)&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;46.79&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.32&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.42&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.59&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Average (MTEB)&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;45.34&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;45.91&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.4&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.79&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Classification&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.25&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.43&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;63.76&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;63.22&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Clustering&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;25.58&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;26.13&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;27.19&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.71&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;PairClassification&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;74.9&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;75.23&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;74.9&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;75.22&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Reranking&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.63&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.73&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.29&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.29&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Retrieval&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;26.14&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;27.17&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;28.93&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;28.93&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;STS&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;68.58&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;69.31&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;70.89&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;70.89&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Summarization&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.2&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.45&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.32&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.35&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;PEARL&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;54.02&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;54.22&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;53.88&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;52.73&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;WordSim&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.18&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.7&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.63&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.63&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;As you can see, adding the improvements increases the scores for distillations across all tasks, with PEARL being the notable exception.&lt;/p&gt;
&lt;h1 id=&quot;where-to-go-from-here&quot;&gt;Where to go from here?&lt;/h1&gt;
&lt;p&gt;One active area of improvement is to make it a lot easier to tune your model on a specific dataset, so that the model gains knowledge about the specific problem or language you’re trying to tackle. This will come up in a next release.&lt;/p&gt;
&lt;p&gt;As always, if you have questions, don’t hesitate to reach out!&lt;/p&gt;</content:encoded></item><item><title>ModernBERT support and why it doesn&apos;t work</title><link>https://minish.ai/blog/2025-01-29-modernbert/</link><guid isPermaLink="true">https://minish.ai/blog/2025-01-29-modernbert/</guid><description>Our newest shiny release is here! 0.3.8! This is a small release in line for a big one we’ll be releasing next week. See here for the release notes, and read on for details about ModernBERT compatibility (spoiler: it’s trickier than you’d think).

</description><pubDate>Wed, 29 Jan 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Our newest shiny release is here! 0.3.8! This is a small release in line for a big one we’ll be releasing next week. See &lt;a href=&quot;https://github.com/MinishLab/model2vec/releases/tag/v0.3.8&quot;&gt;here&lt;/a&gt; for the release notes.&lt;/p&gt;
&lt;p&gt;The biggest feature in this release is support for &lt;a href=&quot;https://huggingface.co/blog/modernbert&quot;&gt;ModernBERT&lt;/a&gt;! As the name implies, ModernBERT is a refresh of the venerable BERT model, trained on more data, with lots of nice tricks; harder, better, faster, stronger. Since its release at the end of last year, many embedders based on ModernBERT have appeared, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://huggingface.co/nomic-ai/modernbert-embed-base&quot;&gt;nomic-ai/modernbert-embed-base&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://huggingface.co/Alibaba-NLP/gte-modernbert-base&quot;&gt;alibaba-nlp/gte-modernbert-base&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And probably many more.&lt;/p&gt;
&lt;p&gt;We didn’t support ModernBert out of the box because of a &lt;del&gt;bug&lt;/del&gt; design decision, which we fixed in this release. Frustratingly, however, distilling a very good ModernBERT model does not lead to a good model2vec model. This blog post details why we think that is the case: we give a bunch of numbers and some explanations.&lt;/p&gt;
&lt;h1 id=&quot;distilling-modernbert&quot;&gt;Distilling ModernBERT&lt;/h1&gt;
&lt;p&gt;As you probably know, a model2vec model is created by:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Downloading an existing sentence transformer&lt;/li&gt;
&lt;li&gt;Embedding all tokens in the vocabulary (without context)&lt;/li&gt;
&lt;li&gt;Reducing the resulting embeddings in size using PCA&lt;/li&gt;
&lt;li&gt;Reweighting them using Zipf weighting&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As ModernBERT-based models have about 50k tokens in their tokenizer, this is also how many embeddings our model2vec model will have.&lt;/p&gt;
&lt;p&gt;So, we created a model2vec distill of both ModernBERT-based embedders above. We fully expected this to work well, because in previous experiments, we saw that BERT-based encoder models worked best for model2vec distillation.&lt;/p&gt;
&lt;p&gt;Here’s the scores on a subset of &lt;a href=&quot;https://huggingface.co/spaces/mteb/leaderboard&quot;&gt;MTEB&lt;/a&gt; tasks, compared with a straight &lt;a href=&quot;https://huggingface.co/BAAI/bge-base-en-v1.5&quot;&gt;BAAI/bge-base-en-v1.5&lt;/a&gt; distill. Note that both &lt;code dir=&quot;auto&quot;&gt;gte-modern-bert-base&lt;/code&gt; and &lt;code dir=&quot;auto&quot;&gt;nomic-ai/modernbert-embed-base&lt;/code&gt; outperform &lt;code dir=&quot;auto&quot;&gt;bge-base-en-v1.5&lt;/code&gt; on the MTEB leaderboard, so we expected a distilled model to also perform better.&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;STS&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;WordSim&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Classification&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;bge-base-en-v1.5&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;69.3&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.7&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;62.4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;gte-modernbert-base&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;66.5 (-2.8)&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;25.6 (-24.1)&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;60.4 (-2.0)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;modernbert-embed-base&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;65.1 (-4.2)&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;26.1 (-23.6)&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;59.4 (-3.0)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;As you can see, that’s not the case at all. &lt;code dir=&quot;auto&quot;&gt;bge-base-en-v1.5&lt;/code&gt; outperforms both ModernBERT-based distills on all tasks, and with a huuuuuge margin on &lt;code dir=&quot;auto&quot;&gt;WordSim&lt;/code&gt;. Luckily for us, the &lt;code dir=&quot;auto&quot;&gt;WordSim&lt;/code&gt; task provides us with a good reason for why this is the case.&lt;/p&gt;
&lt;h2 id=&quot;wordsim&quot;&gt;WordSim&lt;/h2&gt;
&lt;p&gt;First, let’s talk about WordSim! Wordsim is a very simple Semantic Textual Similarity task, comprised of 7 datasets, in which the cosine similarity two embeddings of single words are correlated with real-world judgments of similarity.&lt;/p&gt;
&lt;p&gt;For example, if &lt;code dir=&quot;auto&quot;&gt;apple&lt;/code&gt; and &lt;code dir=&quot;auto&quot;&gt;pear&lt;/code&gt; are judged to be similar by humans, your model must give them a high cosine similarity in order to score high on this task.&lt;/p&gt;
&lt;p&gt;This task is interesting to us because it provides us with an estimate of how good a model2vec model is at modeling lexical similarity without having access to any context. We also see  that being, for model2vec models, performing well on &lt;code dir=&quot;auto&quot;&gt;WordSim&lt;/code&gt; also correlates with performance on other tasks.&lt;/p&gt;
&lt;p&gt;What is interesting about the performance of ModernBERT on &lt;code dir=&quot;auto&quot;&gt;WordSim&lt;/code&gt; is that it is atrociously low, lower than any model we’ve seen before, and that it does not seem to correlate at all with performance on other tasks, on which it scores lower, but not atrociously low.&lt;/p&gt;
&lt;p&gt;But why could this be the case, and why would it hold for both models? Because it seems to hurt both models equally, it looks like something in the base model is to blame.&lt;/p&gt;
&lt;p&gt;In our view, the answer is likely to be the tokenizer used in ModernBERT. ModernBERT’s tokenizer, unlike the traditional &lt;code dir=&quot;auto&quot;&gt;BERT&lt;/code&gt; tokenizer, which is used in a lot of embedders, is a byte-pair encoding (BPE) tokenizer. To see what this means, let’s take a look at five random BPE tokens from ModernBERT’s tokenizer:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Ġnickel&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;ercul&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;tar&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;^),&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;encephal&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;As you can probably see, these tokens are not very likely to be informative by themselves: we can’t just embed &lt;code dir=&quot;auto&quot;&gt;ercul&lt;/code&gt; and expect something useful. In contrast, here’s five tokens from the &lt;code dir=&quot;auto&quot;&gt;WordPiece&lt;/code&gt;-based BERT tokenizer:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;lastly&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;##ect&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;electro&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;defendants&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;ventured&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;As you can see, the &lt;code dir=&quot;auto&quot;&gt;WordPiece&lt;/code&gt; tokenizer has tokens that are more easily interpreted as words. Because BPE tokens are less likely to be words or naturally occurring suffixes, the model likely has to perform more operations to contextualize words, making it a bad fit for uncontextualized embeddings, such as model2vec models.&lt;/p&gt;
&lt;p&gt;In addition: the tokenizer used in ModernBERT is a &lt;em&gt;cased&lt;/em&gt; tokenizer, which means that it contains both upper- and lowercase tokens. But, again, without any contextual cues, there is very little difference between upper- and lowercase tokens.&lt;/p&gt;
&lt;p&gt;We think that both of these factors combined, but especially the BPE tokens, lead to low performance of the distilled model. The fact that both of the ModernBERT based models suffer from the same issue shows that the issue is likely caused by the base model, and not the specific fine-tuning strategy used.&lt;/p&gt;
&lt;h2 id=&quot;fixes-we-tried&quot;&gt;Fixes we tried&lt;/h2&gt;
&lt;p&gt;Of course, we realize you might be skeptical after reading this, so here’s some things we tried:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using CLS pooling&lt;/li&gt;
&lt;li&gt;Using mean pooling&lt;/li&gt;
&lt;li&gt;Pooling by selecting the wordpiece&lt;/li&gt;
&lt;li&gt;Reversing the order of the BOS/EOS tokens&lt;/li&gt;
&lt;li&gt;Not applying PCA&lt;/li&gt;
&lt;li&gt;Not applying Zipf&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And all combinations of the above.&lt;/p&gt;
&lt;h2 id=&quot;future-work&quot;&gt;Future work&lt;/h2&gt;
&lt;p&gt;Support for mutating BPE tokenizers in model2vec is lacking: we don’t allow vocabulary changes for BPE tokenizers, but we do allow it for WordPiece tokenizers.&lt;/p&gt;
&lt;p&gt;If token removal was allowed, we could test whether the casing affects performance. If adding tokens to the tokenizer was allowed, we could see whether adding the words in &lt;code dir=&quot;auto&quot;&gt;WordSim&lt;/code&gt; would improve performance.&lt;/p&gt;
&lt;p&gt;So one thing on our roadmap, but a very low priority one, is to add support for token addition and/or removal to model2vec. If you have an idea on how to do it, please let us know!&lt;/p&gt;</content:encoded></item><item><title>semhash: deduplication and dataset multitool</title><link>https://minish.ai/blog/2025-01-12-semhash-blogpost/</link><guid isPermaLink="true">https://minish.ai/blog/2025-01-12-semhash-blogpost/</guid><description>We’re super excited to announce the release of semhash, our semantic deduplication and dataset multitool (other features coming soon).

</description><pubDate>Sun, 12 Jan 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;We’re super excited to announce the release of &lt;a href=&quot;https://github.com/MinishLab/semhash&quot;&gt;semhash&lt;/a&gt;, our semantic deduplication and dataset multitool (other features coming soon).&lt;/p&gt;
&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;
&lt;p&gt;One area of recent interest, especially around training Large Language Models (LLMs), is that having a lot of data is great, but having a little bit less &lt;em&gt;high quality&lt;/em&gt; data is even better. A good example of this can be found in the &lt;a href=&quot;https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1&quot;&gt;fineweb blogpost&lt;/a&gt;, where the authors start from a really big set of common crawl dumps, on which they perform many quality checks, including dedupication and a suite of quality checks.&lt;/p&gt;
&lt;p&gt;At Minish, we’re interested in unlocking new possibilities by making very fast models. As you may know, we created the best smallest fast model in the world, &lt;a href=&quot;https://huggingface.co/minishlab/potion-base-8M&quot;&gt;potion-base-8m&lt;/a&gt;. One of the areas we are interested in is &lt;code dir=&quot;auto&quot;&gt;approximate deduplication&lt;/code&gt;: we want to remove documents that are semantically very similar from a corpus. Previous text deduplication algorithms, like minhash or simhash, operate on character or word ngrams, and therefore only find similarity between sequences that are orthographically similar, and ignore semantic similarity.&lt;/p&gt;
&lt;p&gt;While deduplication sounds like something that can only benefit LLM training, it can also be really beneficial to check small datasets for overlap: having even approximate overlap between train and test leads to performance overestimation, and having approximate duplicates in train leads to wasted compute, overestimation of feature importance, and a potential host of other issues.&lt;/p&gt;
&lt;p&gt;Additionally, deduplication techniques can also be used to give you a bird’s eye view of larger datasets: checking approximate duplicates using &lt;code dir=&quot;auto&quot;&gt;semhash&lt;/code&gt; takes (milli)seconds, and allows you to see which items from your dataset look alike. If these make sense: great! If there are no duplicates… also great! Everything is better than training on incorrect data.&lt;/p&gt;
&lt;h1 id=&quot;how-can-i-use-deduplication&quot;&gt;How can I use deduplication?&lt;/h1&gt;
&lt;p&gt;Here’s some cool use-cases to give you an idea on when deduplication makes sense:&lt;/p&gt;
&lt;h2 id=&quot;classification&quot;&gt;Classification&lt;/h2&gt;
&lt;p&gt;As mentioned above, it is important that there is no overlap in information between your train and test splits. Having overlap generally means that you overestimate performance, because the model no longer needs to generalize to perform well. Removing duplicates from within the train set, however, can also be very useful. Having a large number of duplicates of the same record in the training set makes the model overestimate the importance of the features of that record, and, in any case, leads to wasted compute and an overestimation of model fit.&lt;/p&gt;
&lt;h2 id=&quot;rag-systems&quot;&gt;RAG systems&lt;/h2&gt;
&lt;p&gt;Duplicates in RAG systems sounds like something rare, until you consider that most RAG systems are built using chunks: while having completely duplicated documents will probably be rare, having duplicate chunks across documents or within documents is a lot more common. Having duplicate chunks in your knowledge base increases storage costs, increases the risk of retrieving irrelevant chunks, and forces you to implement diversification strategies much sooner than necessary.&lt;/p&gt;
&lt;h2 id=&quot;explain-your-corpus&quot;&gt;Explain your corpus&lt;/h2&gt;
&lt;p&gt;By running &lt;code dir=&quot;auto&quot;&gt;semhash&lt;/code&gt; with a low threshold, you can quickly get an overview of which documents are similar to others, and which aren’t. This gives you a good idea of what to focus on, what kind of things are missing from your data, and how your documents relate to one another.&lt;/p&gt;
&lt;h1 id=&quot;how-does-it-work&quot;&gt;How does it work?&lt;/h1&gt;
&lt;p&gt;At its core, &lt;code dir=&quot;auto&quot;&gt;semhash&lt;/code&gt; takes as input a collection of strings or dictionaries. You first initialize a model using set of reference documents, and then use this set of documents to deduplicate an incoming set. Any incoming document that is similar to a document from the reference set is removed, and stored separately with its approximate duplicates from the reference set.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; datasets &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; load_dataset&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; semhash &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; SemHash&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;dataset &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;load_dataset&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;ag_news&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;train &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; dataset[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;train&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;test &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; dataset[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;test&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# This creates an index over your train set. All records are stored in their entirety.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;semhash &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; SemHash.&lt;/span&gt;&lt;span&gt;from_records&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;records&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;train&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;columns&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;text&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# This deduplicates your texts with reference to `train`. Any items occurring in train are&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# removed from test.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; semhash.&lt;/span&gt;&lt;span&gt;deduplicate&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;test&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;threshold&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;0.9&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Set without duplicates&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result.deduplicated&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Duplicates&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result.duplicates&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;During fitting, all document are first encoded by an encoder. The default encoder is &lt;a href=&quot;https://huggingface.co/minishlab/potion-base-8M&quot;&gt;potion-base-8m&lt;/a&gt;, a &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;model2vec&lt;/a&gt; model. The documents are then stored in a &lt;a href=&quot;https://github.com/MinishLab/vicinity&quot;&gt;vicinity&lt;/a&gt; vector store, backed by &lt;a href=&quot;https://github.com/unum-cloud/usearch&quot;&gt;usearch&lt;/a&gt;. Then, for an incoming set of documents, we first encode them using the specified encoder, and then retrieve the nearest neighbors from the vector store. Every incoming document that has a nearest neighbor with a similarity above the threshold gets removed.&lt;/p&gt;
&lt;p&gt;Because all of these components are very fast, deduplicating even really large datasets only takes minutes. For example, deduplicating the entire &lt;a href=&quot;https://huggingface.co/datasets/rajpurkar/squad_v2&quot;&gt;Squad-2.0 dataset&lt;/a&gt; dataset, which has 130000 samples, only takes 7 seconds. This includes vectorization, fitting the index, and the actual deduplication. Smaller datasets only take a fraction of this time, while even datasets containing millions of documents only take minutes. For a comprehensive benchmark, see &lt;a href=&quot;https://github.com/MinishLab/semhash?tab=readme-ov-file#benchmarks&quot;&gt;our benchmarks&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;explainability&quot;&gt;Explainability&lt;/h2&gt;
&lt;p&gt;&lt;code dir=&quot;auto&quot;&gt;semhash&lt;/code&gt; can also be used to investigate your dataset. By using &lt;code dir=&quot;auto&quot;&gt;self_deduplicate&lt;/code&gt;, you can deduplicate the training set itself, which we will use as a jumping off point:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; datasets &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; load_dataset&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; semhash &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; SemHash&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;dataset &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;load_dataset&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;ag_news&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;train &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; dataset[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;train&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;test &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; dataset[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;test&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# This creates an index over your train set. All records are stored in their entirety.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;semhash &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; SemHash.&lt;/span&gt;&lt;span&gt;from_records&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;records&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;train&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;columns&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;text&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; semhash.&lt;/span&gt;&lt;span&gt;self_deduplicate&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;threshold&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;0.9&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Let’s dive into what you can do with the &lt;code dir=&quot;auto&quot;&gt;result&lt;/code&gt;. First off, you can just get all deduplicated records:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result.deduplicated&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;These records are exactly the records you put in, allowing you to use &lt;code dir=&quot;auto&quot;&gt;semhash&lt;/code&gt; within other ML pipelines. &lt;code dir=&quot;auto&quot;&gt;semhash&lt;/code&gt; doesn’t change your data, it just reduces it in size.&lt;/p&gt;
&lt;p&gt;You can easily see the proportion of records that were duplicates:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result.duplicate_ratio&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;or exact duplicates:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result.exact_duplicate_ratio&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;You can also see what got marked as a duplicate, and &lt;em&gt;why&lt;/em&gt;. Each duplicated document gets stored together with the examples from the index that caused it to be marked as such. Exact duplicates get marked as such. The following code example demonstrates basic usage.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;for&lt;/span&gt;&lt;span&gt; duplicated_record &lt;/span&gt;&lt;span&gt;in&lt;/span&gt;&lt;span&gt; results.duplicates:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;duplicated_record.record&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;if&lt;/span&gt;&lt;span&gt; duplicated_record.exact:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;Exact match&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;continue&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;for&lt;/span&gt;&lt;span&gt; index_duplicate &lt;/span&gt;&lt;span&gt;in&lt;/span&gt;&lt;span&gt; duplicated_record.duplicates:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;index_duplicate&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;-&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;*&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;25&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;For ease of use, we also provide a helper function that shows you the &lt;em&gt;least&lt;/em&gt; similar deduplication record in your set of duplicates:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result.&lt;/span&gt;&lt;span&gt;get_least_similar_from_duplicates&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;1&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;If this record still makes a lot of sense to be called a duplicate with reference to the record it duplicated, your duplication strategy makes sense! If it doesn’t you can choose to re-threshold your result set. By doing this, you create a new threshold, thereby removing duplicates. As follows:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;result.duplicate_ratio&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result.&lt;/span&gt;&lt;span&gt;rethreshold&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;0.95&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;result.duplicate_ratio&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;So, a general strategy could be to start with a relatively low threshold, unilt the results returned by &lt;code dir=&quot;auto&quot;&gt;result.get_least_similar_from_duplicates&lt;/code&gt; start making sense. In our experiments, however, a threshold if 0.9, which is the default, works fine, but be sure to check for your individual use-cases.&lt;/p&gt;
&lt;h1 id=&quot;multi-column-data&quot;&gt;Multi-column data&lt;/h1&gt;
&lt;p&gt;&lt;code dir=&quot;auto&quot;&gt;semhash&lt;/code&gt; also supports multi-column datasets, allowing you to deduplicate datasets that have text in multiple columns. For example, in QA datasets, you don’t just want to deduplicate similar questions or similar contexts, but you want to only count items in which both fields are sufficiently similar as duplicated.&lt;/p&gt;
&lt;p&gt;This is a difficult problem to tackle, but &lt;code dir=&quot;auto&quot;&gt;semhash&lt;/code&gt; can also handle this.&lt;/p&gt;
&lt;p&gt;The following snippet demonstrates how this works:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; datasets &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; load_dataset&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; semhash &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; SemHash&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;dataset &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;load_dataset&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;rajpurkar/squad_v2&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;train &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; dataset[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;train&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# This creates an index over your train set. All records are stored in their entirety.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;semhash &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; SemHash.&lt;/span&gt;&lt;span&gt;from_records&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;records&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;train&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;columns&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;context&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;question&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; semhash.&lt;/span&gt;&lt;span&gt;self_deduplicate&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;threshold&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;0.9&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;This computes the similarity and only returns records for which both fields are similar.&lt;/p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Semhash is great! &lt;a href=&quot;https://github.com/MinishLab/semhash&quot;&gt;Get semhash here&lt;/a&gt;!&lt;/p&gt;</content:encoded></item><item><title>POTION: bag of tricks leads to better models</title><link>https://minish.ai/blog/2024-10-29-tokenlearn-blogpost/</link><guid isPermaLink="true">https://minish.ai/blog/2024-10-29-tokenlearn-blogpost/</guid><description>This blogpost describes the Tokenlearn method, which is a method to pre‐train Model2Vec models.

</description><pubDate>Tue, 29 Oct 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;This blogpost describes the &lt;a href=&quot;https://github.com/MinishLab/tokenlearn&quot;&gt;Tokenlearn&lt;/a&gt; method, which is a method to pre-train Model2Vec models.&lt;/p&gt;
&lt;p&gt;We’ve been brewing, concocting, distilling, and came up with a new distillation technique that leads to much better models, which we are now releasing under the name POTION. We open source all models, code, and data.&lt;/p&gt;
&lt;p&gt;We’re releasing three versions: a 64-dim (1.9M params), 128-dim (3.8M params), and 256-dim (7.6M params) model, all based on the same base model, which is, in turn, a bge-base distillation. All POTION models outperform all previous distillations in their size class, and should be considered to be drop-in replacements of our M2V_base_output model. potion-base-8M, in particular, even improves over our largest model, M2V_base_glove. potion-base-8M is better than any set of static embeddings we could find on any task, including glove, fasttext and specialized word embeddings.&lt;/p&gt;
&lt;p&gt;Get them here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://huggingface.co/MinishLab/potion-base-8M&quot;&gt;potion-base-8M&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://huggingface.co/MinishLab/potion-base-4M&quot;&gt;potion-base-4M&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://huggingface.co/MinishLab/potion-base-2M&quot;&gt;potion-base-2M&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Tokenlearn code can be found &lt;a href=&quot;https://github.com/MinishLab/tokenlearn&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The rest of the post will detail how we made the models, how they perform, and further improvements we have in store.&lt;/p&gt;
&lt;h2 id=&quot;distillation&quot;&gt;Distillation&lt;/h2&gt;
&lt;p&gt;In our regular &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;model2vec&lt;/a&gt; framework we distill sentence transformers down to really fast tiny models by doing a forward pass for all tokens separately. We then perform Principal Component Analysis (PCA) on the resulting embeddings, and weigh the individual embeddings via Zipf’s law. See our previous blog post &lt;a href=&quot;https://huggingface.co/blog/Pringled/model2vec&quot;&gt;here&lt;/a&gt;. The new distillation framework is composed of 4 steps.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Model2Vec distillation&lt;/li&gt;
&lt;li&gt;Sentence transformer inference&lt;/li&gt;
&lt;li&gt;Training&lt;/li&gt;
&lt;li&gt;Post-training regularization&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These four steps take a bit longer than the previous distillation framework. If you are looking for a quick way to get a model2vec model, distillation is still your best bet. If you are looking for maximum performance, read on!&lt;/p&gt;
&lt;h3 id=&quot;1-distillation&quot;&gt;1. Distillation&lt;/h3&gt;
&lt;p&gt;We start from a distilled model. In our case, we are using the M2V_base_output model as our starting point.&lt;/p&gt;
&lt;h3 id=&quot;2-sentence-transformer-inference&quot;&gt;2. Sentence transformer inference&lt;/h3&gt;
&lt;p&gt;We then go back to the original big sentence transformer, and use that transformer to create ~1M embeddings on an in-domain corpus, which for us is &lt;a href=&quot;https://huggingface.co/datasets/allenai/c4&quot;&gt;C4&lt;/a&gt;. We then throw away the sentence transformer, never to see it again. Forget it existed.&lt;/p&gt;
&lt;h3 id=&quot;3-training&quot;&gt;3. Training&lt;/h3&gt;
&lt;p&gt;So, we now have a base model, and 1M texts and 1M vector representations of those texts. We then train the base model to minimize the cosine distance between the representations it produces and the representations we produced before. In doing so, our model learns to better mimic representations made by a large model. We also add a super heavy regularization term to the produced embeddings.&lt;/p&gt;
&lt;p&gt;During training, we apply a few standard methods to improve performance, such as reducing the learning rate on plateau, and early stopping.&lt;/p&gt;
&lt;h3 id=&quot;4-post-training-re-regularization&quot;&gt;4. Post-training re-regularization&lt;/h3&gt;
&lt;p&gt;Finally, after training, we &lt;em&gt;re-regularize&lt;/em&gt; our models by performing PCA, and by manually re-weighting individual tokens.&lt;/p&gt;
&lt;p&gt;Of note here is the manual re-weighting, which is very similar to the Zipf weighting we use, but now relies on external data. Before, we assumed that all tokens were in rank order, and simply weighted them as follows:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;w &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;log&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;1&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;/&lt;/span&gt;&lt;span&gt; rank&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;This works really well, as shown in &lt;a href=&quot;https://huggingface.co/blog/Pringled/model2vec&quot;&gt;our original blog post&lt;/a&gt;. Using actual frequencies, however, works even better. We use the same 1M documents on which we trained, and collect token probabilities for all tokens in our vocabulary. We then reweight using the following formula from the &lt;a href=&quot;https://openreview.net/pdf?id=SyK00v5xx&quot;&gt;SIF paper&lt;/a&gt;:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;w &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;1e-3&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;/&lt;/span&gt;&lt;span&gt; (&lt;/span&gt;&lt;span&gt;1e-3&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;+&lt;/span&gt;&lt;span&gt; proba)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;where &lt;code dir=&quot;auto&quot;&gt;proba&lt;/code&gt; is the probability of the token in the corpus. While this does mean our new distillation method relies on some data, it is &lt;em&gt;worth it&lt;/em&gt;, as we will show below.&lt;/p&gt;
&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;
&lt;p&gt;Just like in our original experiments, we again evaluate on MTEB, as well as our two additional tasks (PEARL and WordSim). The results are shown in the table below.&lt;/p&gt;

















































































































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Model&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Avg (All)&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Avg (MTEB)&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Class&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Clust&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;PairClass&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Rank&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Ret&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;STS&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Sum&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Pearl&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;WordSim&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;all-MiniLM-L6-v2&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;56.08&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;56.09&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;62.62&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;41.94&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;82.37&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;58.04&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;41.95&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;78.90&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;30.81&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;60.83&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.91&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;potion-base-8M&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.54&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.03&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;64.44&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;32.93&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;76.62&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.73&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;31.71&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;73.24&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.28&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;53.54&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.75&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_glove_subword&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.06&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;46.69&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.27&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;30.03&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;74.71&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.15&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;27.16&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;69.09&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;30.08&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;56.82&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;57.99&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;potion-base-4M&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.87&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.23&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;62.19&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;31.47&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;75.37&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.75&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.11&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;72.19&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;28.89&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;52.55&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.21&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_glove&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.58&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.6&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.35&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;30.52&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;75.34&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.5&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.26&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;70.31&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;31.5&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.28&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;54.29&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_output&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;46.79&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;45.34&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.25&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;25.58&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;74.9&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.63&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;26.14&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;68.58&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.2&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;54.02&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.18&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;potion-base-2M&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;45.52&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;44.77&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;58.45&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;27.5&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;73.72&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;46.82&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;24.13&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;70.14&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;31.51&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.82&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;44.72&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;GloVe_300d&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;42.84&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;42.36&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;57.31&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;27.66&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;72.48&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;43.3&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;22.78&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.9&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;28.81&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;45.65&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;43.05&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;BPEmb_50k_300d&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;39.34&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;37.78&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;55.76&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;23.35&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;57.86&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;43.21&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;17.5&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;55.1&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.74&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.56&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;41.28&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;As can be seen, potion-base-8M is the best model we have released so far (surpassing the 50% average MTEB score mark!), further pushing the limits of what is possible with static word embeddings. Furthermore, the 4M and 2M models still work quite well, with the 2M model outperforming GloVE while being ~55 times smaller.&lt;/p&gt;
&lt;p&gt;To show the relationship between the number of sentences per second and the average MTEB score, we plot the average MTEB score against sentences per second. The circle sizes correspond to the number of parameters in the models (larger = more parameters).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://minish.ai/images/blog/post_tokenlearn/speed_vs_mteb_score_v2.png&quot; alt=&quot;SpeedvsAccuracy&quot;&gt;
&lt;em&gt;The average MTEB score plotted against sentences per second. The circle size indicates model size.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title>Model2Vec Introduction blogpost</title><link>https://minish.ai/blog/2024-10-14-hf-blogpost/</link><guid isPermaLink="true">https://minish.ai/blog/2024-10-14-hf-blogpost/</guid><description>This blog was first posted on the Hugging Face blog. We’re also posting it here for archival purposes.

</description><pubDate>Mon, 14 Oct 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;This blog was first posted on the &lt;a href=&quot;https://huggingface.co/blog/Pringled/model2vec&quot;&gt;Hugging Face blog&lt;/a&gt;. We’re also posting it here for archival purposes.&lt;/p&gt;
&lt;h1 id=&quot;model2vec-distill-a-small-fast-model-from-any-sentence-transformer&quot;&gt;Model2Vec: Distill a Small Fast Model from any Sentence Transformer&lt;/h1&gt;
&lt;p&gt;(Large) language models have become the de facto standard for feature extraction. While these models have shown state-of-the-art performance on a &lt;a href=&quot;https://huggingface.co/spaces/mteb/leaderboard&quot;&gt;large number of tasks&lt;/a&gt; they also come with heavy resource requirements: large energy consumption, computational demands, and longer processing times. Although there are many ways in which you can make existing (Sentence) Transformers faster, e.g. quantization, or specialized kernels, they are still relatively slow, especially on CPU. What if you need to go faster and are working on a time-constrained product (e.g. a search engine), or have very little resources available?&lt;/p&gt;
&lt;p&gt;This is where &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;Model2Vec&lt;/a&gt; comes in — offering static embeddings that are hardware and eco-friendly while maintaining strong performance.&lt;/p&gt;
&lt;p&gt;In this blog, we will discuss what Model2Vec is, how it works, how you can use it, and its performance.&lt;/p&gt;











&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;https://minish.ai/images/blog/post_hf/ezlo_diagram_side.svg&quot; alt=&quot;Model2Vec&quot;&gt;&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;em&gt;Visualization of the Model2Vec architecture.&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h3 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#what-is-model2vec&quot;&gt;What is model2vec?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#how-to-use-model2vec&quot;&gt;How to use model2vec&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#results&quot;&gt;Results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#acknowledgements&quot;&gt;Acknowledgements&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;what-is-model2vec&quot;&gt;What is Model2Vec?&lt;/h3&gt;
&lt;p&gt;Model2Vec is a technique to distill a small, fast, high performance static model from any Sentence Transformer.  At a high level, it works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. No dataset is needed, just a model (and optionally, a vocabulary). During inference, we simply take the mean of all token embeddings occurring in a sentence. A Model2Vec model is therefore completely uncontextualized. While this may sound like a big downside, we’ll show that it still performs quite well considering how small and fast it is.&lt;/p&gt;
&lt;p&gt;The above might sound like a lot to you, so let’s unpack this a little.&lt;/p&gt;
&lt;h4 id=&quot;transformers-and-embeddings&quot;&gt;Transformers and embeddings&lt;/h4&gt;
&lt;p&gt;In a sentence transformer encoding step, a string is first chopped up into subword tokens. The embeddings of these tokens are then fed through the model, which contextualizes them to create high-quality sentence representations. At the output, you get as many embeddings as you put in, so if your input sentence consists of 10 tokens, you also get 10 output tokens. These tokens are then turned into a sentence representation by a pooling mechanism, which can either be a simple mean, or a special pooler module.&lt;/p&gt;
&lt;p&gt;On to Model2Vec: the project first started as a kind of cache for sentence transformers. Because a transformer vocabulary typically only has about 32k tokens, a word like &lt;code dir=&quot;auto&quot;&gt;astoundingly&lt;/code&gt; gets chopped up into four unique tokens: &lt;code dir=&quot;auto&quot;&gt;&apos;as&apos;, &apos;##tou&apos;, &apos;##nding&apos;, &apos;##ly&apos;&lt;/code&gt;, which means that we re-compute the attention between those four tokens each time this word occurs. But the meaning of this word might not be ambiguous at all!&lt;/p&gt;
&lt;p&gt;However, as we started implementing this, we noticed that you actually do not need to cache any words at all, and you can just use the output representations of individual tokens to get good sentence representations. And this is exactly what the basic mode of operation of Model2Vec is: for each of the 32k input tokens in a sentence transformer vocabulary, we do a forward pass, and then store the resulting embedding. For a new sentence, we then just take the mean of the token embeddings we computed.&lt;/p&gt;
&lt;p&gt;Note that the output token representations of a model2vec model are uncontextualized. Unlike with normal transformer models, there is no way for the model to give different meanings to the same token in different contexts. While this might seem like a huge downside, we think that the actual context provides models with enough disambiguation potential.&lt;/p&gt;
&lt;p&gt;In addition to this trick, we show that two additional tricks are necessary to get optimal performance.&lt;/p&gt;
&lt;h5 id=&quot;pca&quot;&gt;PCA&lt;/h5&gt;
&lt;p&gt;We reduce the dimensionality of the resulting token space by using Principal Component Analysis (PCA). Normally, using PCA is associated with a loss in performance, because you throw away information. However, in our case, reducing the dimensionality actually increased performance significantly. We think this is because PCA also normalizes the resulting space, in the sense of removing biases in the original vector space, thereby making it easier to learn from the vectors.&lt;/p&gt;
&lt;h5 id=&quot;zipf&quot;&gt;Zipf&lt;/h5&gt;
&lt;p&gt;As we take a simple mean over tokens in the space, it is important that the vectors are weighted correctly. Normally, a sentence transformer would be there to correctly weight all the tokens for us given the context, but we don’t have that luxury any more. Intuitively, we would like to use something like Inverse Document Frequency (IDF) to down-weight very frequent or uninteresting words. But we don’t have access to a corpus over which to compute document frequencies.&lt;/p&gt;
&lt;p&gt;To overcome this, we opt to use a well-known principle from language sciences, which is that, given a frequency ranked list, the frequency of the items in that list follow a power law distribution. This is called Zipf’s law. So, if we take the assumption that a vocabulary is ranked by frequency, we can accurately down-weight really frequent items without needing to have access to actual frequencies. As tokenizer vocabularies are sorted by frequency, we already have access to a ranked list, so this optimization can be applied without any additional work.&lt;/p&gt;











&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;https://minish.ai/images/blog/post_hf/pca_zipf.svg&quot; alt=&quot;PCAZipf&quot;&gt;&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;em&gt;Visualization of the effects of applying PCA and Zipf weighting on the embeddings.&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h3 id=&quot;usage&quot;&gt;Usage&lt;/h3&gt;
&lt;p&gt;The Model2Vec library has two broad modes of usage: &lt;strong&gt;distillation&lt;/strong&gt; and &lt;strong&gt;inference&lt;/strong&gt;. In distillation mode, you can distill your own model using any Sentence Transformer (and optionally your own vocabulary). In inference mode, you can use the distilled model (or use one of our pre-distilled models) to generate embeddings for your text data at extremely high speed.&lt;/p&gt;
&lt;p&gt;There are three ways to distill a model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Output&lt;/strong&gt;: behaves much like a real sentence transformer, i.e., it uses a subword tokenizer and simply encodes all wordpieces in its vocabulary. This is really quick to create (30 seconds on a CPU), very small (30 MB in float32), but might be less performant on some tasks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vocab (word)&lt;/strong&gt;: In this mode, you can pass your own vocabulary to create representations. This allows you to create good representations for whatever in-domain data you have, and is a drop-in replacement for GloVe or word2vec.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vocab (subword)&lt;/strong&gt;: In this mode, you can pass your own vocabulary, but it also uses the subword vocabulary to create representations. This allows you to create good representations for whatever in-domain data you have.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that, while vocabulary-based models are larger in terms of RAM, all models are equally fast, because our model is independent of vocabulary size.&lt;/p&gt;
&lt;p&gt;Model2Vec embeddings can be used in a wide variety of applications, such as text classification, clustering, building a search engine, or a RAG system. They are an especially good fit for applications that require fast, lightweight embeddings with low resource requirements.&lt;/p&gt;
&lt;p&gt;As we will show next, Model2Vec is very easy to use. It can either be used as a standalone package, or used directly in &lt;a href=&quot;https://github.com/UKPLab/sentence-transformers&quot;&gt;Sentence Transformers&lt;/a&gt;. This means you can easily integrate it into any pipeline that supports Sentence Transformers (e.g. LangChain and LlamaIndex). You can also train model2vec models directly using Sentence Transformers, keeping the fast inference speed, but optimizing them directly for your use case.&lt;/p&gt;
&lt;h3 id=&quot;how-to-use-model2vec&quot;&gt;How to use Model2Vec&lt;/h3&gt;
&lt;h4 id=&quot;installation&quot;&gt;Installation&lt;/h4&gt;
&lt;p&gt;Model2Vec can be installed using pip:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;pip&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;install&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;model2vec&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;h4 id=&quot;usage-1&quot;&gt;Usage&lt;/h4&gt;
&lt;h5 id=&quot;inference&quot;&gt;Inference&lt;/h5&gt;
&lt;p&gt;The easiest way to get started with Model2Vec is to download one of our flagship models from our &lt;a href=&quot;https://huggingface.co/minishlab&quot;&gt;HuggingFace hub&lt;/a&gt;. These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; model2vec &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; StaticModel&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Load a model from the HuggingFace hub (in this case the M2V_base_output model)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model_name &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;minishlab/M2V_base_output&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; StaticModel.&lt;/span&gt;&lt;span&gt;from_pretrained&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;model_name&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Make embeddings&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;embeddings &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; model.&lt;/span&gt;&lt;span&gt;encode&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;It&apos;s dangerous to go alone!&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;It&apos;s a secret to everybody.&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Or distill your own models and directly use them:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; model2vec &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; distill&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Choose a Sentence Transformer model&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;base_model_name &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;BAAI/bge-base-en-v1.5&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Distill an output model with the chosen dimensions&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;distill&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;model_name&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;base_model_name&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;pca_dims&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;256&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Make embeddings&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;embeddings &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; model.&lt;/span&gt;&lt;span&gt;encode&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;supervillain Ganondorf has invaded Hyrule!&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;model.tokenizer.&lt;/span&gt;&lt;span&gt;encode&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;supervillain Ganondorf has invaded Hyrule!&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;add_special_tokens&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;False&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;span&gt;.tokens&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# [&apos;super&apos;, &apos;##vill&apos;, &apos;##ain&apos;, &apos;gan&apos;, &apos;##ond&apos;, &apos;##orf&apos;, &apos;has&apos;, &apos;invaded&apos;, &apos;h&apos;, &apos;##yr&apos;, &apos;##ule&apos;, &apos;!&apos;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# It looks like we split Ganondorf and Hyrule up into many subtokens&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# To solve this, we can add these words to our vocabulary.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;vocabulary &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;supervillain&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;ganondorf&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;hyrule&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Distill the model with the custom vocabulary.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;distill&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;model_name&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;base_model_name&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;vocabulary&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;vocabulary&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;pca_dims&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;256&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;print&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;model.tokenizer.&lt;/span&gt;&lt;span&gt;encode&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;supervillain Ganondorf has invaded Hyrule!&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;add_special_tokens&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;False&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;span&gt;.tokens&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# [&apos;supervillain&apos;, &apos;ganondorf&apos;, &apos;has&apos;, &apos;invaded&apos;, &apos;hyrule&apos;, &apos;!&apos;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Much better.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Model2Vec is also directly supported in &lt;a href=&quot;https://github.com/UKPLab/sentence-transformers&quot;&gt;Sentence Transformers&lt;/a&gt;. To use Model2Vec in Sentence Transformers, you can initialize a &lt;code dir=&quot;auto&quot;&gt;StaticEmbedding&lt;/code&gt; class using &lt;code dir=&quot;auto&quot;&gt;from_model2vec&lt;/code&gt;. To directly distill in Sentence Transformers, the &lt;code dir=&quot;auto&quot;&gt;StaticEmbedding&lt;/code&gt; class can be initialized using &lt;code dir=&quot;auto&quot;&gt;from_distillation&lt;/code&gt;:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; sentence_transformers &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; SentenceTransformer&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;from&lt;/span&gt;&lt;span&gt; sentence_transformers.models &lt;/span&gt;&lt;span&gt;import&lt;/span&gt;&lt;span&gt; StaticEmbedding&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Initialize a StaticEmbedding module using a pre-trained model&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;static_embedding &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; StaticEmbedding.&lt;/span&gt;&lt;span&gt;from_model2vec&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;minishlab/M2V_base_output&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;SentenceTransformer&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;modules&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;static_embedding&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;embeddings &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; model.&lt;/span&gt;&lt;span&gt;encode&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;It&apos;s dangerous to go alone!&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;It&apos;s a secret to everybody.&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;# Or distill your own directly without leaving sentence-transformers&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;static_embedding &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; StaticEmbedding.&lt;/span&gt;&lt;span&gt;from_distillation&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;BAAI/bge-base-en-v1.5&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;device&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;cpu&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;pca_dims&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;256&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;model &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;SentenceTransformer&lt;/span&gt;&lt;span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;modules&lt;/span&gt;&lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;static_embedding&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;embeddings &lt;/span&gt;&lt;span&gt;=&lt;/span&gt;&lt;span&gt; model.&lt;/span&gt;&lt;span&gt;encode&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;[&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;It&apos;s dangerous to go alone!&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;It&apos;s a secret to everybody.&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;]&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;h3 id=&quot;results&quot;&gt;Results&lt;/h3&gt;
&lt;p&gt;We evaluated Model2Vec on a large number of tasks and datasets. Model2Vec is evaluated on MTEB, as well as two additional tasks: &lt;a href=&quot;https://arxiv.org/pdf/2401.10407&quot;&gt;PEARL&lt;/a&gt; (a phrase representation task) and WordSim (a collection of word similarity tasks). The results are shown in the table below.&lt;/p&gt;







































































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Model&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Avg (All)&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Avg (MTEB)&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Class&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Clust&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;PairClass&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Rank&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Ret&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;STS&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Sum&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Pearl&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;WordSim&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;all-MiniLM-L6-v2&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;56.08&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;56.09&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;62.62&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;41.94&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;82.37&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;58.04&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;41.95&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;78.90&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;30.81&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;60.83&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;49.91&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_glove_subword&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;49.06&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;46.69&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;61.27&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;30.03&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;74.71&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;49.15&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;27.16&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;69.09&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;30.08&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;56.82&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;57.99&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_glove&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;48.58&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;47.60&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;61.35&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;30.52&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;75.34&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;48.50&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;29.26&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;70.31&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;31.50&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;50.28&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;54.29&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_output&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;46.79&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;45.34&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;61.25&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;25.58&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;74.90&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;47.63&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;26.14&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;68.58&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;29.20&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;54.02&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;49.18&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;GloVe_300d&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;42.84&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;42.36&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;57.31&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;27.66&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;72.48&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;43.30&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;22.78&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;61.90&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;28.81&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;45.65&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;43.05&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;BPEmb_50k_300d&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;39.34&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;37.78&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;55.76&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;23.35&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;57.86&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;43.21&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;17.50&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;55.10&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;29.74&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;47.56&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;41.28&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;As can be seen, Model2Vec significantly outperforms GloVe and BPEmb on all tasks, and even outperforms MiniLM, which is a much slower model, on some tasks.&lt;/p&gt;
&lt;p&gt;In addition, we evaluated Model2Vec on a number of classification datasets that are not in MTEB. We also use these to benchmark the speed of the model. The results are shown in the table below.&lt;/p&gt;





































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Model&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;Average&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;SST2&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;IMDB&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;TREC&lt;/th&gt;&lt;th align=&quot;center&quot;&gt;AG News&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;bge-base-en-v1.5&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;90.00&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;91.54&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;91.88&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;85.16&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;91.45&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;all-MiniLM-L6-v2&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;84.10&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;83.95&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;81.36&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;81.31&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;89.77&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_output&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;82.23&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;80.92&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;84.56&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;75.27&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;88.17&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_glove_subword&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;81.95&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;82.84&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;85.96&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;70.51&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;88.49&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;BPEmb_50k_300d&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;81.15&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;80.42&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;84.04&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;71.25&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;88.92&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_glove&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;80.76&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;83.07&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;85.24&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;66.12&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;88.61&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;GloVe_300d&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;77.77&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;81.68&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;84.00&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;55.67&lt;/td&gt;&lt;td align=&quot;center&quot;&gt;89.71&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Again, Model2Vec outperforms GloVe BPEmb on all tasks, and even shows similar performance to MiniLM.&lt;/p&gt;
&lt;p&gt;The figure below shows the relationship between the number of sentences per second and the average classification score. The circle sizes correspond to the number of parameters in the models (larger = more parameters). This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the all-MiniLM-L6-v2 model.&lt;/p&gt;











&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;https://minish.ai/images/blog/post_hf/speed_vs_accuracy.png&quot; alt=&quot;SpeedvsAccuracy&quot;&gt;&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;em&gt;The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size.&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h4 id=&quot;ablations&quot;&gt;Ablations&lt;/h4&gt;
&lt;p&gt;To better understand the factors contributing to the performance of Model2Vec, we conducted a comprehensive set of ablation studies, covering various aspects of the model’s architecture and preprocessing methods. In these studies, we examined the impact of key elements such as PCA, Zipf weighting, and the use of Sentence Transformers versus regular transformer models. We also compared the performance of input embeddings versus output embeddings, since it would seem plausible that these should also work well. The results are shown in the table below.&lt;/p&gt;





















































































































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Model&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Avg (All)&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Avg (MTEB)&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Class&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Clust&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;PairClass&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Rank&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Ret&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;STS&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Sum&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;Pearl&lt;/th&gt;&lt;th align=&quot;right&quot;&gt;WordSim&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_output&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;46.79&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;45.34&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.25&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;25.58&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;74.9&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;47.63&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;26.14&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;68.58&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.2&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;54.02&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.18&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_output_nopca&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;44.04&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;42.31&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.42&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;20.15&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;68.21&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;44.67&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;25.25&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;61.87&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.85&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;51.02&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.96&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_output_nozipf&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;43.61&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;41.52&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;60.44&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;21.62&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;72.15&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;45.57&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;20.35&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;62.71&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;30.66&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;52.28&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.17&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_input_nozipf_nopca&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;40.97&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;39.55&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;54.16&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;18.62&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;68.3&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;43.65&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;23.63&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;59.38&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;32.04&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.19&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;40.52&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_output_nozipf_nopca&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;40.8&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;38.44&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;59.78&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;19.31&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;62.39&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;42.26&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;19.01&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;55.16&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;30&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;49.09&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;48.97&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_base_input&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;40.74&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;39.93&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;60.35&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;22.66&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;59.63&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;43.02&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;25.47&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.05&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29.35&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;50.61&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;34.47&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;M2V_bert_output_nozipf_nopca&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;35.54&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;34.82&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;55.69&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;15.42&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;58.68&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;39.87&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;12.92&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;55.24&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;30.15&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;46.9&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;26.72&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;There’s four main findings in these results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Non-Sentence Transformers do not work well. This can be seen by comparing &lt;code dir=&quot;auto&quot;&gt;M2V_bert_output_nozipf_nopca&lt;/code&gt; (which uses &lt;a href=&quot;https://huggingface.co/google-bert/bert-base-uncased&quot;&gt;BERT&lt;/a&gt;, a non-Sentence Transformer) and &lt;code dir=&quot;auto&quot;&gt;M2V_base_output_nozipf_nopca&lt;/code&gt; (which uses &lt;a href=&quot;https://huggingface.co/BAAI/bge-base-en-v1.5&quot;&gt;BGE-base&lt;/a&gt;, a Sentence Transformer). Using a Sentence Transformer gives a ~5.2% increase in performance.&lt;/li&gt;
&lt;li&gt;PCA is crucial for performance. This can be seen by comparing &lt;code dir=&quot;auto&quot;&gt;M2V_base_output_nozipf_nopca&lt;/code&gt; and &lt;code dir=&quot;auto&quot;&gt;M2V_base_output_nozipf&lt;/code&gt; which gives a ~2.8% increase in performance. Furthermore, PCA improves performance on &lt;em&gt;all&lt;/em&gt; tasks.&lt;/li&gt;
&lt;li&gt;Zipf weighting is crucial for performance. This can be seen by comparing &lt;code dir=&quot;auto&quot;&gt;M2V_base_output_nozipf_nopca&lt;/code&gt; and &lt;code dir=&quot;auto&quot;&gt;M2V_base_output_nopca&lt;/code&gt; which gives a ~3.1% increase in performance.&lt;/li&gt;
&lt;li&gt;Output embeddings outperform input embeddings. This can be seen by comparing &lt;code dir=&quot;auto&quot;&gt;M2V_base_input&lt;/code&gt; and &lt;code dir=&quot;auto&quot;&gt;M2V_base_output&lt;/code&gt; which gives a ~6.1% increase in performance. Note that input embeddings do work well for some tasks. We hypothesize that this is because input embeddings are inherently normalized.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Thanks for reading our blog post on Model2Vec! We hope you found it informative and useful. If you have any questions or comments, please feel free to reach out to us. We are still actively working on the project, and have a number of features already planned, so stay tuned.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://huggingface.co/minishlab&quot;&gt;HuggingFace Org&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://huggingface.co/collections/minishlab/model2vec-base-models-66fd9dd9b7c3b3c0f25ca90e&quot;&gt;HuggingFace Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.linkedin.com/company/minish-lab&quot;&gt;LinkedIn&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/MinishLab/model2vec/tree/main/tutorials&quot;&gt;Tutorials&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;citing&quot;&gt;Citing&lt;/h3&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;@software&lt;/span&gt;&lt;span&gt;{minishlab2024word2vec,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;  &lt;/span&gt;&lt;span&gt;authors&lt;/span&gt;&lt;span&gt; = &lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;Stephan Tulkens, Thomas van Dongen&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;  &lt;/span&gt;&lt;span&gt;title&lt;/span&gt;&lt;span&gt; = &lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;Model2Vec: Turn any Sentence Transformer into a Small Fast Model&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;  &lt;/span&gt;&lt;span&gt;year&lt;/span&gt;&lt;span&gt; = &lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;2024&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;  &lt;/span&gt;&lt;span&gt;url&lt;/span&gt;&lt;span&gt; = &lt;/span&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;https://github.com/MinishLab/model2vec&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;div&gt;&lt;div aria-live=&quot;polite&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;h3 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h3&gt;
&lt;p&gt;We’d like to thank &lt;a href=&quot;https://huggingface.co/tomaarsen&quot;&gt;Tom Aarsen&lt;/a&gt; for integrating Model2Vec into &lt;a href=&quot;https://github.com/UKPLab/sentence-transformers&quot;&gt;Sentence Transformers&lt;/a&gt; and helping us with our &lt;a href=&quot;https://huggingface.co/minishlab&quot;&gt;HuggingFace&lt;/a&gt; integration, as well as his general feedback on the project.&lt;/p&gt;</content:encoded></item></channel></rss>