New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773
Open
HenriqueProj
wants to merge
6
commits into
scikit-learn:main
Choose a base branch
from
HenriqueProj:fix_lof_duplicate_samples
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773
HenriqueProj
wants to merge
6
commits into
scikit-learn:main
from
HenriqueProj:fix_lof_duplicate_samples
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…cated samples Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers. Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers. This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning. Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
I think I don't like the recursive automatic change to neighbors. Maybe we should instead just warn the user when we detect the problem with very negative outlier factor values and let the user re-fit the model with a larger value of |
Removed automatic change to neighbors number and changed the warning Also changed the associated test, to catch the warning.
@ogrisel Changed the code. Now it only raises a warning, as suggested. |
betatim
reviewed
Apr 22, 2024
Changed comment according to review Co-authored-by: Tim Head <betatim@gmail.com>
betatim
approved these changes
Apr 22, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Fixes #27839
What does this implement/fix? Explain your changes.
Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low
negative_outlier_factor_
(under -1e7), labeling them as outliers.This fix checks if the minimum
negative_outlier_factor_
is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.Notes: Added a
handle_duplicates
variable, which allows developers to manually handle the duplicate values, if desired.Also added a
memory_limit
variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.Any other comments?