Installation
To distill, make sure you install the distill extra:Distilling a Model from a Sentence Transformer
To distill a model from a Sentence Transformer, you can use thedistill function. This function allows you to create a lightweight static model from any Sentence Transformer. This can be done on a CPU in a few minutes.
Parameters
Parameters
The model name to use. Any SentenceTransformer-compatible model works.
The vocabulary to use. If
None, uses the model’s built-in vocabulary.The device on which to run distillation (e.g.,
"cpu", "cuda"). If None, defaults to the library’s device selection logic.The number of PCA components to retain. If
None, PCA is skipped; if "auto", we still apply PCA without reducing dimensionality.The SIF coefficient to use for weighting. Must be ≥ 0 and < 1. If
None, no weighting is applied.A regex pattern. Tokens matching this pattern will be removed from the vocabulary before distillation.
Whether to trust remote code when loading components. If
False, only components from transformers are loaded; if True, all remote code is trusted.The data type to quantize the distilled model to (e.g.,
DType.Float16 or its string equivalent). Defaults to float16 quantization.The number of clusters to use for vocabulary quantization. If this is None, no quantization is performed.
The pooling mode to use for creating embeddings. Can be one of:
mean (default): mean over all tokens. Robust and works well in most cases.
last: use the last token’s hidden state (often the [EOS] token). Common for decoder-style models.
first: use the first token’s hidden state ([CLS] token in BERT-style models).
pooler: use the pooler output (if available). This is often a non-linear projection of the [CLS] token.Advanced Distillation Options
There are many ways to customize the distillation process. Here’s an example that uses a custom vocabulary, different PCA dimensions, a different SIF coefficient, int8 quantization, vocabulary quantization, and last-token pooling:Distillation Best Practices
There are number of best practices based on our own extensive experiments that can help you get the most out of your distilled models:- Choose the right base model: The choice of base model is crucial. In our experiments, we found that models that perform better on MTEB do not always lead to better distilled models. We recommend trying out a few different base models to see which one works best for your specific use case. A good starting point is bge-base-en-v1.5 for English data, and bge-m3 for multilingual data, which we also used for distilling the Potion models.
- Use a relevant vocabulary: The vocabulary used for distillation can have a significant impact on performance. If you have a specific domain or use case, consider using a vocabulary that is relevant to that domain.
- Set relatively low dimensionality: When using PCA for dimensionality reduction, it’s almost always safe to choose a lower dimensionality than the used base model. The default we use is 256, and this tends to work just as well as the original dimensions across many tasks.
- Quantize the model: Aggressive quantization usually leads to little to no performance drop (as shown in our size reduction blogpost).
float16is a good default, but in most cases evenint8will work well and reduce model size substantially. - Choose the right pooling mode: The default mode for pooling is mean pooling, which works well in most cases. However, some models (e.g. decoder-style based embedding models such as Qwen3-Embedding) use last-token (EOS) pooling. Since detecting the correct pooling mode is (almost) impossible, carefully check which pooling mode should be used for the used base model if performance is lower than expected.
