Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbscan] [enhancement] Raise warning when dealing with numbers susceptible to precision errors #23584

Closed
YajJackson opened this issue Jun 10, 2022 · 1 comment
Labels
Needs Triage Issue requires triage New Feature

Comments

@YajJackson
Copy link

YajJackson commented Jun 10, 2022

Describe the workflow you want to enable

You have some time series data that you wish to run through dbscan, only to find that your eps value is much lower than what produces expected results.

Consider

(Pdb) df['ts']
0     1.654272e+09
1     1.654272e+09
2     1.654272e+09
3     1.654272e+09
4     1.654272e+09
5     1.654272e+09
6     1.654272e+09
7     1.654272e+09
8     1.654272e+09
9     1.654272e+09
10    1.654272e+09
11    1.654272e+09
12    1.654272e+09
13    1.654272e+09
14    1.654272e+09
15    1.654272e+09
16    1.654272e+09
17    1.654272e+09
18    1.654272e+09
19    1.654272e+09
20    1.654272e+09
21    1.654272e+09
22    1.654272e+09
23    1.654272e+09
24    1.654273e+09
25    1.654273e+09
Name: ts, dtype: float64

(Pdb) max(np.diff(df['ts']))
8.0

It is reasonable for me to assume 8.0 as a fair eps value.

However

(Pdb) DBSCAN(eps=8, min_samples=10).fit_predict(df['ts'].to_numpy().reshape(-1,1))
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1])

Maybe you run some optimization that finds a value that works...

(Pdb) DBSCAN(eps=37, min_samples=10).fit_predict(df['ts'].to_numpy().reshape(-1,1))
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0])

But 37 isn't a good value.

One suggestion is to proportionally scale your data/eps (which is totally fine! There are probably other ways the user should deal with this.)

(Pdb) DBSCAN(eps=8, min_samples=10).fit_predict(df['ts'].to_numpy().reshape(-1,1))
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1])
(Pdb) df['ts'] /= 1000
(Pdb) DBSCAN(eps=8 / 1000, min_samples=10).fit_predict(df['ts'].to_numpy().reshape(-1,1))
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0])

Describe your proposed solution

Warn users when they are performing potentially imprecise clustering.

Describe alternatives you've considered, if relevant

I recognize this as a fair user error that may be out of scope for the project, but allowing for some kind of raise_precision_warnings option might be very helpful to many users. Perhaps for other algorithms as well.

Additional context

@YajJackson YajJackson added Needs Triage Issue requires triage New Feature labels Jun 10, 2022
@YajJackson
Copy link
Author

After revisiting this, I found a preprocessing error on my end that was causing unexpected results. I'm going to close this because I am no longer able to reproduce the precision error I thought I was dealing with.

@YajJackson YajJackson closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage New Feature
Projects
None yet
Development

No branches or pull requests

1 participant