Usage
Indexing
Create an index from a local directory or a remote git repository:
from semble import SembleIndex
# Index a local directoryindex = SembleIndex.from_path("./my-project")
# Index a remote git repository (cloned and cached locally)index = SembleIndex.from_git("https://github.com/MinishLab/model2vec")Indexing a full repo typically takes under 300 ms. Remote repos are cloned on first use and cached for the lifetime of the process.
Advanced options
Both from_path and from_git accept optional parameters to control what gets indexed:
index = SembleIndex.from_path( "./my-project", extensions=frozenset({".py", ".ts"}), # only index these file types ignore=frozenset({"dist", "node_modules"}), # skip these directories include_text_files=True, # also index .md, .yaml, .json, etc.)from_git additionally accepts a ref parameter to check out a specific branch or tag:
index = SembleIndex.from_git( "https://github.com/MinishLab/model2vec", ref="v2.0.0", # branch or tag; defaults to the remote HEAD)Searching
Search the index with a natural-language description or a code snippet:
results = index.search("save model to disk", top_k=5)
for result in results: print(result.chunk.file_path, result.chunk.start_line) print(result.chunk.content) print()Filtering
Restrict results to specific languages or files using filter_languages and filter_paths:
# Only return results from Python filesresults = index.search("parse config", filter_languages=["python"])
# Only return results from specific filesresults = index.search("parse config", filter_paths=["src/config.py", "src/settings.py"])Finding Related Code
Given any search result, find other chunks that are semantically similar to it:
results = index.search("tokenizer encode", top_k=1)related = index.find_related(results[0], top_k=5)
for r in related: print(r.chunk.file_path, r.chunk.start_line)This is useful for exploring implementations. Start from one function and surface the code that uses or resembles it.
Search Modes
The mode parameter controls the retrieval strategy:
# Default: hybrid (BM25 + semantic, recommended)results = index.search("parse config", mode="hybrid")
# Semantic only (best for natural-language queries)results = index.search("parse config", mode="semantic")
# Lexical only (best for exact identifier lookups)results = index.search("parse_config", mode="bm25")Result Fields
Each result object exposes:
result = results[0]
result.score # float, relevance scoreresult.chunk.file_path # "src/config.py"result.chunk.start_line # 42result.chunk.end_line # 67result.chunk.content # raw source code of the chunkIndex Stats
Inspect the state of an index with the stats property:
stats = index.stats
stats.indexed_files # number of files indexedstats.total_chunks # total number of chunksstats.languages # dict mapping language name to chunk count # e.g. {"python": 412, "typescript": 88}