Tokenlearn
Tokenlearn is a method for pre-training Model2Vec models. It uses a pre-distilled Model2Vec model and pre-trains it on a large corpus of mean embeddings from a teacher model.
Quick Start
Installation
Install Tokenlearn with the following command:
pip install tokenlearnCreating Features
Create features with the following command:
python -m tokenlearn.featurize \ --model-name "baai/bge-base-en-v1.5" \ --output-dir "data/c4_features" \ --dataset-path "allenai/c4" \ --dataset-name "en" \ --dataset-split "train"Training a Model
Train a model with the following command:
python -m tokenlearn.train \ --model-name "baai/bge-base-en-v1.5" \ --data-path "data/c4_features" \ --save-path "<path-to-save-model>"