Representative Sampling from a Single Dataset
To perform representative sampling from a single dataset, you can use theself_find_representative method. This method will select a subset of samples that best represent the entire dataset based on their semantic similarity.
Parameters
Parameters
Number of representatives to select.
Number of top candidates to consider for diversification. Defaults to “auto”, which calculates the limit based on the total number of records (typically 10% of the dataset, with a min of 100 and max of 1000).
Trade-off between diversity (1.0) and relevance (0.0). Must be between 0 and 1. Higher values prioritize diversity, lower values prioritize relevance.
Diversification strategy to use. Options:
"MMR", "MSD", "DPP", "COVER", "SSD". Default is MMR (Maximal Marginal Relevance).Representative Sampling Across Multiple Datasets
To perform representative sampling across multiple datasets, you can use thefind_representative method. This method allows you to select a subset of samples from one dataset that best represents another dataset.
Parameters
Parameters
The new set of records (e.g., a test set) to find representative samples with against the fitted dataset.
Number of representatives to select.
Number of top candidates to consider for diversification. Defaults to “auto”, which calculates the limit based on the total number of records (typically 10% of the dataset, with a min of 100 and max of 1000).
Trade-off between diversity (1.0) and relevance (0.0). Must be between 0 and 1. Higher values prioritize diversity, lower values prioritize relevance.
Diversification strategy to use. Options:
"MMR", "MSD", "DPP", "COVER", "SSD". Default is MMR (Maximal Marginal Relevance).