Visualization of the Model2Vec architecture. |
astoundingly
gets chopped up into four unique tokens: 'as', '##tou', '##nding', '##ly'
, which means that we re-compute the attention between those four tokens each time this word occurs. But the meaning of this word might not be ambiguous at all!
However, as we started implementing this, we noticed that you actually do not need to cache any words at all, and you can just use the output representations of individual tokens to get good sentence representations. And this is exactly what the basic mode of operation of Model2Vec is: for each of the 32k input tokens in a sentence transformer vocabulary, we do a forward pass, and then store the resulting embedding. For a new sentence, we then just take the mean of the token embeddings we computed.
Note that the output token representations of a model2vec model are uncontextualized. Unlike with normal transformer models, there is no way for the model to give different meanings to the same token in different contexts. While this might seem like a huge downside, we think that the actual context provides models with enough disambiguation potential.
In addition to this trick, we show that two additional tricks are necessary to get optimal performance.
Visualization of the effects of applying PCA and Zipf weighting on the embeddings. |
StaticEmbedding
class using from_model2vec
. To directly distill in Sentence Transformers, the StaticEmbedding
class can be initialized using from_distillation
:
Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | Pearl | WordSim |
---|---|---|---|---|---|---|---|---|---|---|---|
all-MiniLM-L6-v2 | 56.08 | 56.09 | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 60.83 | 49.91 |
M2V_base_glove_subword | 49.06 | 46.69 | 61.27 | 30.03 | 74.71 | 49.15 | 27.16 | 69.09 | 30.08 | 56.82 | 57.99 |
M2V_base_glove | 48.58 | 47.60 | 61.35 | 30.52 | 75.34 | 48.50 | 29.26 | 70.31 | 31.50 | 50.28 | 54.29 |
M2V_base_output | 46.79 | 45.34 | 61.25 | 25.58 | 74.90 | 47.63 | 26.14 | 68.58 | 29.20 | 54.02 | 49.18 |
GloVe_300d | 42.84 | 42.36 | 57.31 | 27.66 | 72.48 | 43.30 | 22.78 | 61.90 | 28.81 | 45.65 | 43.05 |
BPEmb_50k_300d | 39.34 | 37.78 | 55.76 | 23.35 | 57.86 | 43.21 | 17.50 | 55.10 | 29.74 | 47.56 | 41.28 |
Model | Average | SST2 | IMDB | TREC | AG News |
---|---|---|---|---|---|
bge-base-en-v1.5 | 90.00 | 91.54 | 91.88 | 85.16 | 91.45 |
all-MiniLM-L6-v2 | 84.10 | 83.95 | 81.36 | 81.31 | 89.77 |
M2V_base_output | 82.23 | 80.92 | 84.56 | 75.27 | 88.17 |
M2V_base_glove_subword | 81.95 | 82.84 | 85.96 | 70.51 | 88.49 |
BPEmb_50k_300d | 81.15 | 80.42 | 84.04 | 71.25 | 88.92 |
M2V_base_glove | 80.76 | 83.07 | 85.24 | 66.12 | 88.61 |
GloVe_300d | 77.77 | 81.68 | 84.00 | 55.67 | 89.71 |
![]() |
---|
The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size. |
Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | Pearl | WordSim |
---|---|---|---|---|---|---|---|---|---|---|---|
M2V_base_output | 46.79 | 45.34 | 61.25 | 25.58 | 74.9 | 47.63 | 26.14 | 68.58 | 29.2 | 54.02 | 49.18 |
M2V_base_output_nopca | 44.04 | 42.31 | 61.42 | 20.15 | 68.21 | 44.67 | 25.25 | 61.87 | 29.85 | 51.02 | 48.96 |
M2V_base_output_nozipf | 43.61 | 41.52 | 60.44 | 21.62 | 72.15 | 45.57 | 20.35 | 62.71 | 30.66 | 52.28 | 49.17 |
M2V_base_input_nozipf_nopca | 40.97 | 39.55 | 54.16 | 18.62 | 68.3 | 43.65 | 23.63 | 59.38 | 32.04 | 50.19 | 40.52 |
M2V_base_output_nozipf_nopca | 40.8 | 38.44 | 59.78 | 19.31 | 62.39 | 42.26 | 19.01 | 55.16 | 30 | 49.09 | 48.97 |
M2V_base_input | 40.74 | 39.93 | 60.35 | 22.66 | 59.63 | 43.02 | 25.47 | 50.05 | 29.35 | 50.61 | 34.47 |
M2V_bert_output_nozipf_nopca | 35.54 | 34.82 | 55.69 | 15.42 | 58.68 | 39.87 | 12.92 | 55.24 | 30.15 | 46.9 | 26.72 |
M2V_bert_output_nozipf_nopca
(which uses BERT, a non-Sentence Transformer) and M2V_base_output_nozipf_nopca
(which uses BGE-base, a Sentence Transformer). Using a Sentence Transformer gives a ~5.2% increase in performance.M2V_base_output_nozipf_nopca
and M2V_base_output_nozipf
which gives a ~2.8% increase in performance. Furthermore, PCA improves performance on all tasks.M2V_base_output_nozipf_nopca
and M2V_base_output_nopca
which gives a ~3.1% increase in performance.M2V_base_input
and M2V_base_output
which gives a ~6.1% increase in performance. Note that input embeddings do work well for some tasks. We hypothesize that this is because input embeddings are inherently normalized.