Skip to main content

Installation

To distill, make sure you install the distill extra:
pip install model2vec[distill]

Distilling a Model from a Sentence Transformer

To distill a model from a Sentence Transformer, you can use the distill function. This function allows you to create a lightweight static model from any Sentence Transformer. This can be done on a CPU in a few minutes.
from model2vec.distill import distill

# Distill a Sentence Transformer model
m2v_model = distill(model_name="BAAI/bge-base-en-v1.5")

# Save the model
m2v_model.save_pretrained("m2v_model")
model_name
str
The model name to use. Any SentenceTransformer-compatible model works.
vocabulary
list[str] | None
default:"None"
The vocabulary to use. If None, uses the model’s built-in vocabulary.
device
str | None
default:"None"
The device on which to run distillation (e.g., "cpu", "cuda"). If None, defaults to the library’s device selection logic.
pca_dims
PCADimType
default:"256"
The number of PCA components to retain. If None, PCA is skipped; if "auto", we still apply PCA without reducing dimensionality.
sif_coefficient
float | None
default:"1e-4"
The SIF coefficient to use for weighting. Must be ≥ 0 and < 1. If None, no weighting is applied.
token_remove_pattern
str | None
default:"\\\\[unused\\\\d+\\\\]"
A regex pattern. Tokens matching this pattern will be removed from the vocabulary before distillation.
trust_remote_code
bool
default:"False"
Whether to trust remote code when loading components. If False, only components from transformers are loaded; if True, all remote code is trusted.
quantize_to
DType | str
default:"DType.Float16"
The data type to quantize the distilled model to (e.g., DType.Float16 or its string equivalent). Defaults to float16 quantization.
vocabulary_quantization
int | None
default:"None"
The number of clusters to use for vocabulary quantization. If this is None, no quantization is performed.
pooling
PoolingMode | str
default:"PoolingMode.MEAN"
The pooling mode to use for creating embeddings. Can be one of: mean (default): mean over all tokens. Robust and works well in most cases. last: use the last token’s hidden state (often the [EOS] token). Common for decoder-style models. first: use the first token’s hidden state ([CLS] token in BERT-style models). pooler: use the pooler output (if available). This is often a non-linear projection of the [CLS] token.

Advanced Distillation Options

There are many ways to customize the distillation process. Here’s an example that uses a custom vocabulary, different PCA dimensions, a different SIF coefficient, int8 quantization, vocabulary quantization, and last-token pooling:
from model2vec.distill import distill
m2v_model = distill(
    model_name="Qwen/Qwen3-Embedding-0.6B",
    vocabulary=["star wars", "lightsaber", "jedi", "sith"], # Add a custom vocabulary
    pca_dims=128, # Reduce to 128 dimensions
    sif_coefficient=1e-5, # Use a different SIF coefficient for weighting
    quantize_to="int8", # Quantize to int8
    vocabulary_quantization=10000, # Use vocabulary quantization with 1000 clusters to reduce vocabulary size
    pooling="last" # Use last token pooling
)

Distillation Best Practices

There are number of best practices based on our own extensive experiments that can help you get the most out of your distilled models:
  • Choose the right base model: The choice of base model is crucial. In our experiments, we found that models that perform better on MTEB do not always lead to better distilled models. We recommend trying out a few different base models to see which one works best for your specific use case. A good starting point is bge-base-en-v1.5 for English data, and bge-m3 for multilingual data, which we also used for distilling the Potion models.
  • Use a relevant vocabulary: The vocabulary used for distillation can have a significant impact on performance. If you have a specific domain or use case, consider using a vocabulary that is relevant to that domain.
  • Set relatively low dimensionality: When using PCA for dimensionality reduction, it’s almost always safe to choose a lower dimensionality than the used base model. The default we use is 256, and this tends to work just as well as the original dimensions across many tasks.
  • Quantize the model: Aggressive quantization usually leads to little to no performance drop (as shown in our size reduction blogpost). float16 is a good default, but in most cases even int8 will work well and reduce model size substantially.
  • Choose the right pooling mode: The default mode for pooling is mean pooling, which works well in most cases. However, some models (e.g. decoder-style based embedding models such as Qwen3-Embedding) use last-token (EOS) pooling. Since detecting the correct pooling mode is (almost) impossible, carefully check which pooling mode should be used for the used base model if performance is lower than expected.