The results of Model2Vec on MTEB and training results
potion
and M2V
models are our static models.
Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | Pearl | WordSim |
---|---|---|---|---|---|---|---|---|---|---|---|
all-MiniLM-L6-v2 | 56.08 | 56.09 | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 60.83 | 49.91 |
potion-base-32M | 52.46 | 51.66 | 65.97 | 35.29 | 78.17 | 50.92 | 33.52 | 74.22 | 29.78 | 55.37 | 55.15 |
potion-base-8M | 50.54 | 50.03 | 64.44 | 32.93 | 76.62 | 49.73 | 31.71 | 73.24 | 29.28 | 53.54 | 50.75 |
potion-retrieval-32M | 49.73 | 49.76 | 59.56 | 30.55 | 76.38 | 50.05 | 36.35 | 73.22 | 28.85 | 49.31 | 50.02 |
potion-base-4M | 48.87 | 48.23 | 62.19 | 31.47 | 75.37 | 48.75 | 29.11 | 72.19 | 28.89 | 52.55 | 49.21 |
static-retrieval-mrl-en-v1 | 48.18 | 48.36 | 57.39 | 28.32 | 75.63 | 49.16 | 35.61 | 72.18 | 28.64 | 49.68 | 44.76 |
static-similarity-mrl-multilingual-v1 | 48.15 | 47.15 | 59.96 | 24.40 | 79.02 | 48.25 | 29.54 | 74.88 | 30.28 | 51.66 | 51.66 |
M2V_base_output | 46.79 | 45.34 | 61.25 | 25.58 | 74.9 | 47.63 | 26.14 | 68.58 | 29.2 | 54.02 | 49.18 |
potion-base-2M | 45.52 | 44.77 | 58.45 | 27.5 | 73.72 | 46.82 | 24.13 | 70.14 | 31.51 | 50.82 | 44.72 |
GloVe_300d | 42.84 | 42.36 | 57.31 | 27.66 | 72.48 | 43.3 | 22.78 | 61.9 | 28.81 | 45.65 | 43.05 |
BPEmb_50k_300d | 39.34 | 37.78 | 55.76 | 23.35 | 57.86 | 43.21 | 17.5 | 55.1 | 29.74 | 47.56 | 41.28 |
encode
.
![]() |
---|
Figure: The average MTEB score plotted against sentences per second. The circle size indicates model size. |
Model | Mean (Task) | Mean (TaskType) | BitMining | Class | Clust | InstRet | MultiClass | PairClass | Rank | Ret | STS |
---|---|---|---|---|---|---|---|---|---|---|---|
LaBSE | 52.07 | 45.65 | 76.35 | 54.60 | 38.08 | −3.00 | 20.12 | 75.97 | 50.20 | 33.17 | 65.35 |
potion-multilingual-128M | 47.31 | 40.40 | 40.72 | 52.36 | 38.80 | −2.08 | 15.95 | 71.39 | 47.39 | 37.86 | 61.23 |
static-similarity-mrl-multilingual-v1 | 47.24 | 41.38 | 50.62 | 48.60 | 30.67 | −1.24 | 14.74 | 74.34 | 49.45 | 41.21 | 64.02 |
M2V_multilingual_output | 42.13 | 35.89 | 36.88 | 49.75 | 30.09 | −0.07 | 14.34 | 69.74 | 41.51 | 25.42 | 55.33 |
Model | Retrieval Score |
---|---|
all-MiniLM-L6-v2 | 41.95 |
potion-retrieval-32M | 36.35 |
static-retrieval-mrl-en-v1 | 35.61 |
potion-base-32M | 33.52 |
potion-base-8M | 31.71 |
model2vec + logreg
: A model2vec model with a scikit-learn LogisticRegressionCV
on top.model2vec full finetune
: A model2vec classifier with the full model finetuned. This uses our StaticModelForClassification
.tfidf
: A TF-IDF model with a scikit-learn LogisticRegressionCV
on top.setfit
: A SetFit model trained using all-minilm-l6-v2 as a base model.bge-base + logreg
: A BGE-base encoder model with a scikit-learn LogisticRegressionCV
on top.dataset | tfidf | model2vec + logreg | model2vec full finetune | setfit | bge-base + logreg |
---|---|---|---|---|---|
20_newgroups | 50.71 | 56.24 | 57.94 | 61.29 | 67.39 |
ade | 71.46 | 79.20 | 79.68 | 83.05 | 86.12 |
ag_news | 81.68 | 86.70 | 87.20 | 88.01 | 88.95 |
amazon_counterfactual | 85.18 | 90.96 | 91.93 | 95.51 | 92.74 |
bbc | 95.09 | 95.80 | 97.21 | 96.60 | 97.50 |
emotion | 59.28 | 65.57 | 67.11 | 72.86 | 65.63 |
enron_spam | 96.00 | 96.40 | 96.85 | 97.45 | 97.30 |
hatespeech_offensive | 66.45 | 83.54 | 85.61 | 87.69 | 84.92 |
imdb | 80.44 | 85.34 | 85.59 | 86.00 | 92.25 |
massive_scenario | 77.26 | 82.86 | 84.42 | 83.54 | 87.07 |
senteval_cr | 65.61 | 77.03 | 79.47 | 86.15 | 90.53 |
sst5 | 18.52 | 32.34 | 37.95 | 42.31 | 38.49 |
student | 74.16 | 83.20 | 85.02 | 89.62 | 89.71 |
subj | 86.39 | 89.20 | 89.85 | 93.80 | 94.55 |
tweet_sentiment_extraction | 53.20 | 64.96 | 62.65 | 75.15 | 69.48 |
tfidf | model2vec + logreg | model2vec full finetune | setfit | bge-base + logreg | |
---|---|---|---|---|---|
average | 70.8 | 78.0 | 79.2 | 82.6 | 82.8 |
potion-base-32m
, and to use full fine-tuning if you are starting from another base model.
The speed difference between model2vec and the other models is immeense, with the full finetune being 35x faster than a setfit based on all-minilm-l6-v2
on CPU and 200x faster than thebge-base
transformer model.
tfidf | model2vec + logreg | model2vec full finetune | setfit | bge-base + logreg | |
---|---|---|---|---|---|
samples / second | 108434 | 17925 | 24744 | 716 | 118 |
![]() |
---|
Figure: The average training score plotted against sentences per second (log scale). |
Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | Pearl | WordSim |
---|---|---|---|---|---|---|---|---|---|---|---|
M2V_base_output | 46.79 | 45.34 | 61.25 | 25.58 | 74.9 | 47.63 | 26.14 | 68.58 | 29.2 | 54.02 | 49.18 |
M2V_base_output_nopca | 44.04 | 42.31 | 61.42 | 20.15 | 68.21 | 44.67 | 25.25 | 61.87 | 29.85 | 51.02 | 48.96 |
M2V_base_output_nozipf | 43.61 | 41.52 | 60.44 | 21.62 | 72.15 | 45.57 | 20.35 | 62.71 | 30.66 | 52.28 | 49.17 |
M2V_base_input_nozipf_nopca | 40.97 | 39.55 | 54.16 | 18.62 | 68.3 | 43.65 | 23.63 | 59.38 | 32.04 | 50.19 | 40.52 |
M2V_base_output_nozipf_nopca | 40.8 | 38.44 | 59.78 | 19.31 | 62.39 | 42.26 | 19.01 | 55.16 | 30 | 49.09 | 48.97 |
M2V_base_input | 40.74 | 39.93 | 60.35 | 22.66 | 59.63 | 43.02 | 25.47 | 50.05 | 29.35 | 50.61 | 34.47 |
M2V_bert_output_nozipf_nopca | 35.54 | 34.82 | 55.69 | 15.42 | 58.68 | 39.87 | 12.92 | 55.24 | 30.15 | 46.9 | 26.72 |
M2V_bert_output_nozipf_nopca
(which uses BERT, a non-Sentence Transformer) and M2V_base_output_nozipf_nopca
(which uses BGE-base, a Sentence Transformer). Using a Sentence Transformer gives a ~5.2% increase in performance.M2V_base_output_nozipf_nopca
and M2V_base_output_nozipf
which gives a ~2.8% increase in performance. Furthermore, PCA improves performance on all tasks.M2V_base_output_nozipf_nopca
and M2V_base_output_nopca
which gives a ~3.1% increase in performance.M2V_base_input
and M2V_base_output
which gives a ~6.1% increase in performance. Note that input embeddings do work well for some tasks. We hypothesize that this is because input embeddings are inherently normalized.