We’ve benchmarked SemHash on a variety of datasets to measure the deduplication performance and speed. The benchmarks were run with the following setup:

  • The benchmarks were all run on CPU
  • The benchmarks were all run with use_ann=True
  • The used encoder is the default encoder (potion-base-8M).
  • The timings include the encoding time, index building time, and deduplication time.

Train Deduplication Benchmark

DatasetOriginal Train SizeDeduplicated Train Size% RemovedDeduplication Time (s)
bbc122511446.610.57
senteval_cr301229900.730.14
tweet_sentiment_extraction27481266952.861.77
emotion16000156951.910.77
amazon_counterfactual500049920.160.33
ag_news12000010692110.905.20
enron_spam317162054035.242.03
subj800079900.120.63
sst5854485260.210.58
20_newgroups11314106845.570.73
hatespeech_offensive22783220903.040.92
ade176371571810.880.73
imdb25000248300.681.76
massive_scenario11514936618.660.47
student1175196385645.668.80
squad_v213031910969815.828.81
wikitext180135088464550.8983.53

Train/Test Deduplication Benchmark

DatasetTrain SizeTest SizeDeduplicated Test Size% RemovedDeduplication Time (s)
bbc1225100087013.000.71
senteval_cr30127537500.400.13
tweet_sentiment_extraction27481353434123.451.53
emotion16000200019263.700.65
amazon_counterfactual5000500049900.200.51
ag_news1200007600619818.453.74
enron_spam317162000106047.001.94
subj8000200019990.050.62
sst58544221022050.230.59
20_newgroups11314753270985.762.25
hatespeech_offensive22783200019253.750.77
ade176375879495215.770.81
imdb2500025000247950.822.81
massive_scenario115142974219026.360.46
student1175195000239352.143.78
squad_v213031911873118630.087.13
wikitext18013504358213950.9240.32

As can be seen, SemHash is extremely fast, and scales to large datasets with millions of records. There are some notable examples of train/test leakage, such as enron_spam and student, where the test dataset contains a significant amount of semantic overlap with the training dataset.