gte-modern-bert-base
and nomic-ai/modernbert-embed-base
outperform bge-base-en-v1.5
on the MTEB leaderboard, so we expected a distilled model to also perform better.
STS | WordSim | Classification | |
---|---|---|---|
bge-base-en-v1.5 | 69.3 | 49.7 | 62.4 |
gte-modernbert-base | 66.5 (-2.8) | 25.6 (-24.1) | 60.4 (-2.0) |
modernbert-embed-base | 65.1 (-4.2) | 26.1 (-23.6) | 59.4 (-3.0) |
bge-base-en-v1.5
outperforms both ModernBERT-based distills on all tasks, and with a huuuuuge margin on WordSim
. Luckily for us, the WordSim
task provides us with a good reason for why this is the case.
apple
and pear
are judged to be similar by humans, your model must give them a high cosine similarity in order to score high on this task.
This task is interesting to us because it provides us with an estimate of how good a model2vec model is at modeling lexical similarity without having access to any context. We also see that being, for model2vec models, performing well on WordSim
also correlates with performance on other tasks.
What is interesting about the performance of ModernBERT on WordSim
is that it is atrociously low, lower than any model we’ve seen before, and that it does not seem to correlate at all with performance on other tasks, on which it scores lower, but not atrociously low.
But why could this be the case, and why would it hold for both models? Because it seems to hurt both models equally, it looks like something in the base model is to blame.
In our view, the answer is likely to be the tokenizer used in ModernBERT. ModernBERT’s tokenizer, unlike the traditional BERT
tokenizer, which is used in a lot of embedders, is a byte-pair encoding (BPE) tokenizer. To see what this means, let’s take a look at five random BPE tokens from ModernBERT’s tokenizer:
ercul
and expect something useful. In contrast, here’s five tokens from the WordPiece
-based BERT tokenizer:
WordPiece
tokenizer has tokens that are more easily interpreted as words. Because BPE tokens are less likely to be words or naturally occurring suffixes, the model likely has to perform more operations to contextualize words, making it a bad fit for uncontextualized embeddings, such as model2vec models.
In addition: the tokenizer used in ModernBERT is a cased tokenizer, which means that it contains both upper- and lowercase tokens. But, again, without any contextual cues, there is very little difference between upper- and lowercase tokens.
We think that both of these factors combined, but especially the BPE tokens, lead to low performance of the distilled model. The fact that both of the ModernBERT based models suffer from the same issue shows that the issue is likely caused by the base model, and not the specific fine-tuning strategy used.
WordSim
would improve performance.
So one thing on our roadmap, but a very low priority one, is to add support for token addition and/or removal to model2vec. If you have an idea on how to do it, please let us know!