- Model2Vec models are distilled versions of sentence transformers
- Model2Vec models are drop-in replacements for sentence transformers
Classification
To test the classification efficacy of Model2Vec in comparison to fasttext, we ran experiments on 15 datasets from the setfit organization on Hugging Face. We initialized the fasttext classifier using thewiki-news-300d-1M-subword.vec
vectors we got here, and used the nltk word_tokenize
function as a tokenizer. The Model2Vec model was initialized from minishlab/potion-base-32m
. All models were trained with sensible defaults. We optimized the tokenization for the fasttext model.
Performance
Here’s the full table for both approaches:dataset_name | Model2Vec | fasttext |
---|---|---|
20_newsgroups | 66.18 | 57.11 |
ade | 88.05 | 86.14 |
ag_news | 91.61 | 92.18 |
amazon_counterfactual | 80.64 | 82.58 |
bbc | 96.93 | 95.86 |
emotion | 79.7 | 79.35 |
enron_spam | 98.85 | 98.85 |
hatespeech_offensive | 70.94 | 69.48 |
imdb | 87.79 | 89.51 |
massive_scenario | 88.78 | 87.26 |
senteval_cr | 79.12 | 76.2 |
sst5 | 41.49 | 19.59 |
student | 93.76 | 93.12 |
subj | 91.94 | 92.6 |
tweet_sentiment_extraction | 74.14 | 68.98 |
Average | 81.99 | 79.25 |
Training time
Fasttext models train faster than Model2Vec models. Note that this just concerns the actual training of the supervised classifier; we don’t include any pretraining time. One observation is that Model2Vec models tend to train for a bit too long. Also, the training time is less than a minute for both approaches, so…Inference time
Model2Vec processes about 14.6k samples per second, while fasttext processes about 3.6k. This with the caveat that the inference time of both approaches is difficult to compare, since almost all of the time for both models is actually spent in the tokenizer. Disabling any preprocessing for fasttext makes it faster than Model2Vec (3.6k -> 25k (!) samples/second), albeit with a hit to performance (79.5 -> 78.5 average score). This underscores one of the painful issues of older NLP approaches, which is that preprocessing/tokenization matters a lot, and that it is difficult to find alignment between your tokenization and models found online.Model size
The trained Model2Vec model is only 130 MB on disk, while the fasttext model is substantially larger at 2.1 gigabytes. Note, however, that both Model2Vec and fasttext can be compressed through quantization.Zero shot (MTEB)
fasttext vectors can also be used as word2vec embeddings. As such, we can test how well they work on the Massive Text Embedding Benchmark (MTEB) as a zero shot embedding approach, comparing directly with a Model2Vec model. Following the above, we use the nltkword_tokenize
function to tokenize text going into fasttext, and also normalize all the output vectors to unit length. We perform no additional preprocessing for Model2Vec. Because running MTEB can take a long time, we don’t run all subsets. We use the original MTEB benchmark.
Results
It’s honestly not looking too good for fasttext. Model2Vec blows it out of the water at all tasks, excluding WordSim, which is a set of word similarity tasks.fasttext | Model2Vec | |
---|---|---|
Classification | 51.97 | 65.97 |
Clustering | 22.25 | 35.29 |
PairClassification | 47.89 | 78.17 |
Reranking | 40.7 | 50.92 |
STS | 48.2 | 74.22 |
Summarization | 29.41 | 29.78 |
WordSim | 59.29 | 55.15 |
Wordsim
, which is a collection of lexical similarity tasks. This is interesting, because these kinds of datasets were popular around the time fasttext, GloVe, word2vec, and other static methods were initially created. So this could be one of the reasons these vectors work well: methods are developed with reference to the evaluation data that is available at the time of development.