wiki-news-300d-1M-subword.vec
vectors we got here, and used the nltk word_tokenize
function as a tokenizer. The Model2Vec model was initialized from minishlab/potion-base-32m
. All models were trained with sensible defaults. We optimized the tokenization for the fasttext model.
dataset_name | Model2Vec | fasttext |
---|---|---|
20_newsgroups | 66.18 | 57.11 |
ade | 88.05 | 86.14 |
ag_news | 91.61 | 92.18 |
amazon_counterfactual | 80.64 | 82.58 |
bbc | 96.93 | 95.86 |
emotion | 79.7 | 79.35 |
enron_spam | 98.85 | 98.85 |
hatespeech_offensive | 70.94 | 69.48 |
imdb | 87.79 | 89.51 |
massive_scenario | 88.78 | 87.26 |
senteval_cr | 79.12 | 76.2 |
sst5 | 41.49 | 19.59 |
student | 93.76 | 93.12 |
subj | 91.94 | 92.6 |
tweet_sentiment_extraction | 74.14 | 68.98 |
Average | 81.99 | 79.25 |
word_tokenize
function to tokenize text going into fasttext, and also normalize all the output vectors to unit length. We perform no additional preprocessing for Model2Vec. Because running MTEB can take a long time, we don’t run all subsets. We use the original MTEB benchmark.
fasttext | Model2Vec | |
---|---|---|
Classification | 51.97 | 65.97 |
Clustering | 22.25 | 35.29 |
PairClassification | 47.89 | 78.17 |
Reranking | 40.7 | 50.92 |
STS | 48.2 | 74.22 |
Summarization | 29.41 | 29.78 |
WordSim | 59.29 | 55.15 |
Wordsim
, which is a collection of lexical similarity tasks. This is interesting, because these kinds of datasets were popular around the time fasttext, GloVe, word2vec, and other static methods were initially created. So this could be one of the reasons these vectors work well: methods are developed with reference to the evaluation data that is available at the time of development.