Representative Sampling
SemHash provides a flexible interface for representative sampling, which is useful for selecting a subset of data that best represents the entire dataset. This can be particularly useful in scenarios where you want to reduce the size of your dataset while retaining its diversity.
This works by first selecting samples that have the highest average similarity to other samples in the dataset (the most “central” samples), and then applying a diversification strategy to select samples that are diverse within that candidate set.
SemHash uses Pyversity for diversification, supporting multiple strategies.
Representative Sampling from a Single Dataset
To perform representative sampling from a single dataset, you can use the self_find_representative method. This method will select a subset of samples that best represent the entire dataset based on their semantic similarity.
from datasets import load_datasetfrom semhash import SemHash
# Load a dataset to filtertexts = load_dataset("ag_news", split="train")["text"]
# Initialize a SemHash instancesemhash = SemHash.from_records(records=texts)
# Find representative samples from the textsrepresentative_texts = semhash.self_find_representative().selectedParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
selection_size | int | 10 | Number of representatives to select. |
candidate_limit | int | Literal['auto'] | auto | Number of top candidates to consider for diversification. Defaults to “auto”, which calculates the limit based on the total number of records (typically 10% of the dataset, with a min of 100 and max of 1000). |
diversity | float | 0.5 | Trade-off between diversity (1.0) and relevance (0.0). Must be between 0 and 1. Higher values prioritize diversity, lower values prioritize relevance. |
strategy | Strategy | str | Strategy.MMR | Diversification strategy to use. Options: "MMR", "MSD", "DPP", "COVER", "SSD". Default is MMR (Maximal Marginal Relevance). |
Representative Sampling Across Multiple Datasets
To perform representative sampling across multiple datasets, you can use the find_representative method. This method allows you to select a subset of samples from one dataset that best represents another dataset.
from datasets import load_datasetfrom semhash import SemHash# Load two datasets to filtertrain_texts = load_dataset("ag_news", split="train")["text"]test_texts = load_dataset("ag_news", split="test")["text"]
# Initialize a SemHash instance with the training datasemhash = SemHash.from_records(records=train_texts)
# Find representative samples from the test data against the training datarepresentative_test_texts = semhash.find_representative(records=test_texts).selectedParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
records | Sequence[Record] | The new set of records (e.g., a test set) to find representative samples with against the fitted dataset. | |
selection_size | int | 10 | Number of representatives to select. |
candidate_limit | int | Literal['auto'] | auto | Number of top candidates to consider for diversification. Defaults to “auto”, which calculates the limit based on the total number of records (typically 10% of the dataset, with a min of 100 and max of 1000). |
diversity | float | 0.5 | Trade-off between diversity (1.0) and relevance (0.0). Must be between 0 and 1. Higher values prioritize diversity, lower values prioritize relevance. |
strategy | Strategy | str | Strategy.MMR | Diversification strategy to use. Options: "MMR", "MSD", "DPP", "COVER", "SSD". Default is MMR (Maximal Marginal Relevance). |
Representative Sampling from a Multi-Column Dataset
If you have a multi-column dataset, you can sample representatives from it by specifying the columns to use for representative sampling.
from datasets import load_datasetfrom semhash import SemHash
# Load the datasetdataset = load_dataset("squad_v2", split="train")
# Convert the dataset to a list of dictionariesrecords = [dict(row) for row in dataset]
# Initialize SemHash with the columns to deduplicatesemhash = SemHash.from_records(records=records, columns=["question", "context"])
# Find representative samples from the recordsrepresentative_records = semhash.self_find_representative().selectedCustomizing Diversification
You can customize the diversification strategy and trade-off between relevance and diversity:
from semhash import SemHashfrom pyversity import Strategy
# Load your datasetsemhash = SemHash.from_records(records=texts)
# Use different strategiesmmr_samples = semhash.self_find_representative( selection_size=20, diversity=0.5, strategy=Strategy.MMR # Default).selected
dpp_samples = semhash.self_find_representative( selection_size=20, diversity=0.8, # Higher diversity strategy=Strategy.DPP # Different strategy).selected