Filter outliers with SemHash
self_filter_outliers
method. This method will remove samples that are considered outliers based on their semantic similarity to the rest of the dataset.
Parameters
filter_outliers
method. This method allows you to remove outliers from one dataset based on another dataset, which is useful for ensuring that your test set does not contain samples that are significantly different from your training set.
Parameters
FilterResult
returned by the outlier filtering methods provides several useful attributes:
selected
: The records that were not considered outliers.filtered
: The records that were considered outliers.scores_selected
: The similarity scores for the selected records.scores_filtered
: The similarity scores for the filtered records.filter_ratio
: The ratio of records that were filtered out as outliers.selected_ratio
: The ratio of records that were selected as non-outliers.