Skip to main content
We’ve benchmarked SemHash on a variety of text and image datasets to measure the deduplication performance and speed.

Text Benchmarks

All text benchmarks were run with the following setup:
  • CPU-only (no GPU acceleration)
  • ANN backend: Default backend (USearch)
  • Encoder: Default encoder (potion-base-8M)
  • Timing includes encoding time, index building time, and deduplication time

Train Deduplication Benchmark

DatasetOriginal Train SizeDeduplicated Train Size% RemovedDeduplication Time (s)
bbc122511446.610.57
senteval_cr301229900.730.14
tweet_sentiment_extraction27481266952.861.77
emotion16000156951.910.77
amazon_counterfactual500049920.160.33
ag_news12000010692110.905.20
enron_spam317162054035.242.03
subj800079900.120.63
sst5854485260.210.58
20_newgroups11314106845.570.73
hatespeech_offensive22783220903.040.92
ade176371571810.880.73
imdb25000248300.681.76
massive_scenario11514936618.660.47
student1175196385645.668.80
squad_v213031910969815.828.81
wikitext180135088464550.8983.53

Train/Test Deduplication Benchmark

DatasetTrain SizeTest SizeDeduplicated Test Size% RemovedDeduplication Time (s)
bbc1225100087013.000.71
senteval_cr30127537500.400.13
tweet_sentiment_extraction27481353434123.451.53
emotion16000200019263.700.65
amazon_counterfactual5000500049900.200.51
ag_news1200007600619818.453.74
enron_spam317162000106047.001.94
subj8000200019990.050.62
sst58544221022050.230.59
20_newgroups11314753270985.762.25
hatespeech_offensive22783200019253.750.77
ade176375879495215.770.81
imdb2500025000247950.822.81
massive_scenario115142974219026.360.46
student1175195000239352.143.78
squad_v213031911873118630.087.13
wikitext18013504358213950.9240.32

Key Findings

SemHash is extremely fast and scales to large datasets with millions of records:
  • Speed: Deduplication is fast even for large datasets (e.g., 1.8M records in ~83 seconds)
  • Train/Test Leakage: Several datasets show significant train/test overlap:
    • enron_spam: 47% of test data overlaps with training data
    • student: 52% of test data overlaps with training data
    • wikitext: 51% of test data overlaps with training data

Image Benchmarks

All image benchmarks were run with the following setup:
  • Device: Apple Silicon GPU (MPS)
  • ANN backend: Default backend (USearch)
  • Encoder: MobileNetV3-Small (mobilenetv3_small_100.lamb_in1k)
  • Batch size: 128 images per batch
  • Timing includes encoding time, index building time, and deduplication time

Train Deduplication Benchmark

DatasetOriginal Train SizeDeduplicated Train Size% RemovedDeduplication Time (s)
cifar1050000482743.4561.20
fashion_mnist600001671472.1486.61

Train/Test Deduplication Benchmark

DatasetTrain SizeTest SizeDeduplicated Test Size% RemovedDeduplication Time (s)
cifar10500001000093976.0367.43
fashion_mnist6000010000205279.4872.14

Key Findings

  • Fashion-MNIST high deduplication: Fashion-MNIST shows very high duplication rates (72% train, 79% test) due to the simple nature of the dataset (10 clothing categories with similar items)
  • CIFAR-10 moderate deduplication: CIFAR-10 shows lower duplication (3.45% train, 6.03% test) as it contains more diverse natural images
  • Speed: Image deduplication is fast even for large datasets (60k images in ~87 seconds on MPS); note that the actual deduplication step is quick, with most time spent on encoding images