Model2Vec is typically viewed as a fast alternative to a sentence transformer. There’s a good reason for that, because:
  1. Model2Vec models are distilled versions of sentence transformers
  2. Model2Vec models are drop-in replacements for sentence transformers
Having said that, a better comparison would actually be Meta’s fasttext. Like Model2Vec, fasttext can be used to create static vectors, and can also be used to create classifiers using a set of static vectors as a starting point. In this short blog post, we’ll show that off-the-shelf Model2Vec models are much better than fasttext classifiers and word vectors. So, if you’re currently using fasttext somewhere, consider a comparison to Model2Vec!

Classification

To test the classification efficacy of Model2Vec in comparison to fasttext, we ran experiments on 15 datasets from the setfit organization on Hugging Face. We initialized the fasttext classifier using the wiki-news-300d-1M-subword.vec vectors we got here, and used the nltk word_tokenize function as a tokenizer. The Model2Vec model was initialized from minishlab/potion-base-32m. All models were trained with sensible defaults. We optimized the tokenization for the fasttext model.

Performance

Here’s the full table for both approaches:
dataset_nameModel2Vecfasttext
20_newsgroups66.1857.11
ade88.0586.14
ag_news91.6192.18
amazon_counterfactual80.6482.58
bbc96.9395.86
emotion79.779.35
enron_spam98.8598.85
hatespeech_offensive70.9469.48
imdb87.7989.51
massive_scenario88.7887.26
senteval_cr79.1276.2
sst541.4919.59
student93.7693.12
subj91.9492.6
tweet_sentiment_extraction74.1468.98
Average81.9979.25
Model2Vec outperforms fasttext on average, although with smaller gains than we anticipated.

Training time

Fasttext models train faster than Model2Vec models. Note that this just concerns the actual training of the supervised classifier; we don’t include any pretraining time. One observation is that Model2Vec models tend to train for a bit too long. Also, the training time is less than a minute for both approaches, so…

Inference time

Model2Vec processes about 14.6k samples per second, while fasttext processes about 3.6k. This with the caveat that the inference time of both approaches is difficult to compare, since almost all of the time for both models is actually spent in the tokenizer. Disabling any preprocessing for fasttext makes it faster than Model2Vec (3.6k -> 25k (!) samples/second), albeit with a hit to performance (79.5 -> 78.5 average score). This underscores one of the painful issues of older NLP approaches, which is that preprocessing/tokenization matters a lot, and that it is difficult to find alignment between your tokenization and models found online.

Model size

The trained Model2Vec model is only 130 MB on disk, while the fasttext model is substantially larger at 2.1 gigabytes. Note, however, that both Model2Vec and fasttext can be compressed through quantization.

Zero shot (MTEB)

fasttext vectors can also be used as word2vec embeddings. As such, we can test how well they work on the Massive Text Embedding Benchmark (MTEB) as a zero shot embedding approach, comparing directly with a Model2Vec model. Following the above, we use the nltk word_tokenize function to tokenize text going into fasttext, and also normalize all the output vectors to unit length. We perform no additional preprocessing for Model2Vec. Because running MTEB can take a long time, we don’t run all subsets. We use the original MTEB benchmark.

Results

It’s honestly not looking too good for fasttext. Model2Vec blows it out of the water at all tasks, excluding WordSim, which is a set of word similarity tasks.
fasttextModel2Vec
Classification51.9765.97
Clustering22.2535.29
PairClassification47.8978.17
Reranking40.750.92
STS48.274.22
Summarization29.4129.78
WordSim59.2955.15
The only task on which fasttext performs well is Wordsim, which is a collection of lexical similarity tasks. This is interesting, because these kinds of datasets were popular around the time fasttext, GloVe, word2vec, and other static methods were initially created. So this could be one of the reasons these vectors work well: methods are developed with reference to the evaluation data that is available at the time of development.

Conclusion

If you’re still using fastText, be it for classification or for word embeddings, it’s probably time for an upgrade. Model2Vec offers smaller models, faster inference, and better downstream results in most cases. Give it a try, and benchmark for yourself!