Skip to content

Tokenlearn

Tokenlearn is a method for pre-training Model2Vec models. It uses a pre-distilled Model2Vec model and pre-trains it on a large corpus of mean embeddings from a teacher model.

Quick Start

Installation

Install Tokenlearn with the following command:

Terminal window
pip install tokenlearn

Creating Features

Create features with the following command:

Terminal window
python -m tokenlearn.featurize \
--model-name "baai/bge-base-en-v1.5" \
--output-dir "data/c4_features" \
--dataset-path "allenai/c4" \
--dataset-name "en" \
--dataset-split "train"

Training a Model

Train a model with the following command:

Terminal window
python -m tokenlearn.train \
--model-name "baai/bge-base-en-v1.5" \
--data-path "data/c4_features" \
--save-path "<path-to-save-model>"