Representative Sampling

SemHash provides a flexible interface for representative sampling, which is useful for selecting a subset of data that best represents the entire dataset. This can be particularly useful in scenarios where you want to reduce the size of your dataset while retaining its diversity.

This works by first selecting samples that have the highest average similarity to other samples in the dataset (the most “central” samples), and then applying a diversification strategy to select samples that are diverse within that candidate set.

SemHash uses Pyversity for diversification, supporting multiple strategies.

Representative Sampling from a Single Dataset

To perform representative sampling from a single dataset, you can use the self_find_representative method. This method will select a subset of samples that best represent the entire dataset based on their semantic similarity.

from datasets import load_dataset
from semhash import SemHash

# Load a dataset to filter
texts = load_dataset("ag_news", split="train")["text"]

# Initialize a SemHash instance
semhash = SemHash.from_records(records=texts)

# Find representative samples from the texts
representative_texts = semhash.self_find_representative().selected

Parameters

Parameter	Type	Default	Description
`selection_size`	`int`	`10`	Number of representatives to select.
`candidate_limit`	`int \| Literal['auto']`	`auto`	Number of top candidates to consider for diversification. Defaults to “auto”, which calculates the limit based on the total number of records (typically 10% of the dataset, with a min of 100 and max of 1000).
`diversity`	`float`	`0.5`	Trade-off between diversity (1.0) and relevance (0.0). Must be between 0 and 1. Higher values prioritize diversity, lower values prioritize relevance.
`strategy`	`Strategy \| str`	`Strategy.MMR`	Diversification strategy to use. Options: `"MMR"`, `"MSD"`, `"DPP"`, `"COVER"`, `"SSD"`. Default is MMR (Maximal Marginal Relevance).

Representative Sampling Across Multiple Datasets

To perform representative sampling across multiple datasets, you can use the find_representative method. This method allows you to select a subset of samples from one dataset that best represents another dataset.

from datasets import load_dataset
from semhash import SemHash
# Load two datasets to filter
train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance with the training data
semhash = SemHash.from_records(records=train_texts)

# Find representative samples from the test data against the training data
representative_test_texts = semhash.find_representative(records=test_texts).selected

Parameters

Parameter	Type	Default	Description
`records`	`Sequence[Record]`		The new set of records (e.g., a test set) to find representative samples with against the fitted dataset.
`selection_size`	`int`	`10`	Number of representatives to select.
`candidate_limit`	`int \| Literal['auto']`	`auto`	Number of top candidates to consider for diversification. Defaults to “auto”, which calculates the limit based on the total number of records (typically 10% of the dataset, with a min of 100 and max of 1000).
`diversity`	`float`	`0.5`	Trade-off between diversity (1.0) and relevance (0.0). Must be between 0 and 1. Higher values prioritize diversity, lower values prioritize relevance.
`strategy`	`Strategy \| str`	`Strategy.MMR`	Diversification strategy to use. Options: `"MMR"`, `"MSD"`, `"DPP"`, `"COVER"`, `"SSD"`. Default is MMR (Maximal Marginal Relevance).

Representative Sampling from a Multi-Column Dataset

If you have a multi-column dataset, you can sample representatives from it by specifying the columns to use for representative sampling.

from datasets import load_dataset
from semhash import SemHash

# Load the dataset
dataset = load_dataset("squad_v2", split="train")

# Convert the dataset to a list of dictionaries
records = [dict(row) for row in dataset]

# Initialize SemHash with the columns to deduplicate
semhash = SemHash.from_records(records=records, columns=["question", "context"])

# Find representative samples from the records
representative_records = semhash.self_find_representative().selected

Customizing Diversification

You can customize the diversification strategy and trade-off between relevance and diversity:

from semhash import SemHash
from pyversity import Strategy

# Load your dataset
semhash = SemHash.from_records(records=texts)

# Use different strategies
mmr_samples = semhash.self_find_representative(
    selection_size=20,
    diversity=0.5,
    strategy=Strategy.MMR  # Default
).selected

dpp_samples = semhash.self_find_representative(
    selection_size=20,
    diversity=0.8,  # Higher diversity
    strategy=Strategy.DPP # Different strategy
).selected