Improvements
Here are the improvements, in order of their appearance. In the last section, we’ll contrast all of them, and show their impact MTEB performance. For all experiments, we distillbaai/bge-base-en-v1.5
using the default parameters.
Basic
As a reference, the basic operations we apply when distilling are:- Token selection: we propagate all individual tokens through the model together with an EOS and BOS token, and then select the middle token as the representation.
- PCA: apply PCA with a specific number of dimensions (256 for all models).
- Zipf: we weight all individual tokens by estimating their frequency using Zipf’s law. The short of it is that we assume all tokens in the vocabulary are in rank order, and that they follow a power law distribution.
- Replacing PCA: we tried ICA, umap, and T-SNE. All worked a lot worse.
- Using different propagation strategies: we tried not including BOS/EOS, either only BOS or only EOS, and pooling over the BOS token (i.e.,
[CLS]
pooling). - Using different weighting strategies, including TF-IDF.
1. Pooling
As a first change: we switched from selecting the token to mean pooling, that is, the representation of a token is the mean of theEOS token BOS
we pass forward through the network.
In code:
2. SIF weighting
Following this, we replaced the Zipf weighting with a strategy based on the well-known SIF algorithm. In short, this algorithm creates a probability distribution over all tokens in the vocabulary, and downweights very frequent tokens, while upweighting very infrequent tokens. For weighting, it uses the following formula:proba
is a vector of token probabilities. As before, we use Zipf’s law to actually estimate the token probabilities, because we don’t actually have access to them. Applying this on top of the mean pooling raises the score from 45.91 to 47.40.
3. Normalization
Normalization has been a part of model2vec from the very first version. This is a boolean flag that, when set toTrue
, unit normalizes all output vectors. This is set to False
by default, but this turns out to be a bad choice. Setting it to True
has a significant positive effect, especially on retrieval and clustering, and raises the average score from 47.40 to a whopping 47.79.
Taking stock
If you want more details, you can find the full table below. As you can see, the improvements we found are general, in the sense that they improve performance for all tasks except PEARL. Anecdotally, this also seems to hold for other models we tried.m2v_base_output | +mean pooling | +sif | +norm | |
---|---|---|---|---|
Average (All) | 46.79 | 47.32 | 48.42 | 48.59 |
Average (MTEB) | 45.34 | 45.91 | 47.4 | 47.79 |
Classification | 61.25 | 61.43 | 63.76 | 63.22 |
Clustering | 25.58 | 26.13 | 27.19 | 29.71 |
PairClassification | 74.9 | 75.23 | 74.9 | 75.22 |
Reranking | 47.63 | 47.73 | 48.29 | 48.29 |
Retrieval | 26.14 | 27.17 | 28.93 | 28.93 |
STS | 68.58 | 69.31 | 70.89 | 70.89 |
Summarization | 29.2 | 29.45 | 29.32 | 29.35 |
PEARL | 54.02 | 54.22 | 53.88 | 52.73 |
WordSim | 49.18 | 49.7 | 49.63 | 49.63 |