Skip to content

Usage

Indexing

Create an index from a local directory or a remote git repository:

from semble import SembleIndex
# Index a local directory
index = SembleIndex.from_path("./my-project")
# Index a remote git repository (cloned and cached locally)
index = SembleIndex.from_git("https://github.com/MinishLab/model2vec")

Indexing a full repo typically takes under 300 ms. Remote repos are cloned on first use and cached for the lifetime of the process.

Advanced options

Both from_path and from_git accept optional parameters to control what gets indexed:

index = SembleIndex.from_path(
"./my-project",
extensions=frozenset({".py", ".ts"}), # only index these file types
ignore=frozenset({"dist", "node_modules"}), # skip these directories
include_text_files=True, # also index .md, .yaml, .json, etc.
)

from_git additionally accepts a ref parameter to check out a specific branch or tag:

index = SembleIndex.from_git(
"https://github.com/MinishLab/model2vec",
ref="v2.0.0", # branch or tag; defaults to the remote HEAD
)

Searching

Search the index with a natural-language description or a code snippet:

results = index.search("save model to disk", top_k=5)
for result in results:
print(result.chunk.file_path, result.chunk.start_line)
print(result.chunk.content)
print()

Filtering

Restrict results to specific languages or files using filter_languages and filter_paths:

# Only return results from Python files
results = index.search("parse config", filter_languages=["python"])
# Only return results from specific files
results = index.search("parse config", filter_paths=["src/config.py", "src/settings.py"])

Given any search result, find other chunks that are semantically similar to it:

results = index.search("tokenizer encode", top_k=1)
related = index.find_related(results[0], top_k=5)
for r in related:
print(r.chunk.file_path, r.chunk.start_line)

This is useful for exploring implementations. Start from one function and surface the code that uses or resembles it.

Search Modes

The mode parameter controls the retrieval strategy:

# Default: hybrid (BM25 + semantic, recommended)
results = index.search("parse config", mode="hybrid")
# Semantic only (best for natural-language queries)
results = index.search("parse config", mode="semantic")
# Lexical only (best for exact identifier lookups)
results = index.search("parse_config", mode="bm25")

Result Fields

Each result object exposes:

result = results[0]
result.score # float, relevance score
result.chunk.file_path # "src/config.py"
result.chunk.start_line # 42
result.chunk.end_line # 67
result.chunk.content # raw source code of the chunk

Index Stats

Inspect the state of an index with the stats property:

stats = index.stats
stats.indexed_files # number of files indexed
stats.total_chunks # total number of chunks
stats.languages # dict mapping language name to chunk count
# e.g. {"python": 412, "typescript": 88}