Outlier Filtering
Filter outliers with SemHash
SemHash provides a simple way to filter outliers from a dataset. This works by selecting the samples that have the lowest average similarity to other samples in the dataset. This is particularly useful when you want to remove samples that are significantly different from the rest of the dataset, which can help improve the quality of your data.
Filter Outliers from a Single Dataset
To filter outliers from a single dataset, you can use the self_filter_outliers
method. This method will remove samples that are considered outliers based on their semantic similarity to the rest of the dataset.
Filter Outliers Across Multiple Datasets
To filter outliers across multiple datasets, you can use the filter_outliers
method. This method allows you to remove outliers from one dataset based on another dataset, which is useful for ensuring that your test set does not contain samples that are significantly different from your training set.
Filter Outliers from a a Multi-Column Dataset
If you have a multi-column dataset, you can filter outliers from it by specifying the columns to use for outlier detection and filtering.
FilterResult Functionality
The FilterResult
returned by the outlier filtering methods provides several useful attributes:
selected
: The records that were not considered outliers.filtered
: The records that were considered outliers.scores_selected
: The similarity scores for the selected records.scores_filtered
: The similarity scores for the filtered records.filter_ratio
: The ratio of records that were filtered out as outliers.selected_ratio
: The ratio of records that were selected as non-outliers.