Skip to content

Benchmarks

We benchmark quality and speed across all methods on ~1,250 queries over 63 repositories in 19 languages.

Main Results

MethodNDCG@10Index timeQuery p50
CodeRankEmbed Hybrid0.86257 s16 ms
semble0.854263 ms1.5 ms
CodeRankEmbed0.76557 s16 ms
ColGREP0.6935.8 s124 ms
BM250.673263 ms0.02 ms
grepai0.56135 s48 ms
probe0.387-207 ms
ripgrep0.126-12 ms

Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing 218× faster and answering queries 11× faster, entirely on CPU.

The charts below plot latency against NDCG@10. Marker size reflects model parameter count.

Speed vs quality (cold start) Time to first result (index + query) vs NDCG@10

Speed vs quality (warm) Query latency on a warm index vs NDCG@10

Token Efficiency

Coding agents (Claude Code, OpenCode, etc.) typically find code by running grep on keywords and reading the matched files. We model that workflow and compare it against semble’s chunk retrieval across our full benchmark of 1,251 queries.

Token efficiency: recall vs. retrieved tokens

Expected tokens per query

For each query: tokens consumed at first relevant hit, or 32k if the method never finds anything. Averaged across all 1,251 queries.

MethodExpected tokensSavings
ripgrep + read file45,692baseline
semble56698% fewer

Recall at fixed token budgets

A relevant file is “covered” once any retrieved unit comes from it.

Method5001k2k4k8k16k32k
semble0.6850.8490.9380.9760.9910.9960.996
ripgrep + read file0.0010.0080.0370.0880.2120.3790.583
Methodology

Semble returns the top-50 ranked chunks. ripgrep+read splits the query into keywords (dropping stopwords and short words), runs rg --fixed-strings --ignore-case for each keyword, then reads matched files in full ranked by how many distinct keywords they contain. Both methods search the same set of file types and ignored directories. Tokens are counted with cl100k_base via tiktoken. A relevant file is “covered” once any retrieved unit overlaps its annotated span.

By Language

NDCG@10 per language. Best score per row is bolded.

LanguagesembleCRE HybridCREColGREPgrepaiproberipgrep
scala0.9090.9220.8450.7650.3300.3920.180
cpp0.9150.9130.8460.6260.7310.3750.126
ruby0.9090.9090.7690.7080.6430.3820.230
elixir0.8940.9050.8690.8080.6690.4120.134
javascript0.9170.9030.9200.8230.6750.5880.176
zig0.9130.9010.8070.4740.7550.3690.000
csharp0.8850.8890.7430.6140.2770.3920.117
go0.8950.8840.6760.7850.7220.4100.133
python0.8670.8800.7940.7770.6340.4880.202
php0.8580.8740.7580.6630.4020.3400.123
swift0.8600.8730.7210.7100.4290.2800.160
bash0.8250.8520.8920.7060.7230.2260.000
lua0.8230.8470.8030.7980.6990.3360.000
java0.8490.8410.7060.6410.3860.5360.198
kotlin0.8210.8300.6700.6370.4780.3350.166
rust0.8560.8270.6270.6620.5190.2420.162
c0.7410.8060.7060.6760.5550.3840.000
haskell0.7650.7710.7760.6830.4830.3130.000
typescript0.7060.7080.5450.4300.3940.3540.128
overall0.8540.8620.7650.6930.5610.3870.126

Ablations

raw returns retrieval scores directly; + ranking feeds them through semble’s hybrid reranker.

RetrievalRaw+ ranking
BM250.6750.834
potion-code-16M0.6500.821
BM25 + potion-code-16M-0.854

By query category:

ModeArchitectureSemanticSymbol
BM25 raw0.6280.6760.719
potion-code-16M raw0.6260.6660.629
semble BM25 (+ ranking)0.7700.8190.957
semble potion-code-16M (+ ranking)0.7570.8080.943
semble hybrid0.8020.8460.958

Dataset

~1,250 queries over 63 repositories in 19 languages, grouped into three categories:

CategoryQueriesWhat it tests
semantic711Code that implements a specific behavior or concept
architecture343Design decisions, module boundaries, structural patterns
symbol204Named entity lookup (function, class, type, variable)

Languages covered: bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotlin, Lua, PHP, Python, Ruby, Rust, Scala, Swift, TypeScript, Zig.

Methods

  • ripgrep: fast regex search, included as a raw keyword-match baseline.
  • probe: BM25 keyword ranking backed by tree-sitter parse trees. No persistent index; scans on the fly.
  • ColGREP: late-interaction code retrieval with the LateOn-Code-edge model.
  • grepai: semantic search using nomic-embed-text (137M params) via a local Ollama daemon.
  • CodeRankEmbed: 137M-param transformer embedding model. CRE Hybrid fuses its dense scores with BM25.
  • semble: potion-code-16M static embeddings + BM25 + the semble reranking stack.