[dbscan] [enhancement] Raise warning when dealing with numbers susceptible to precision errors #23584

YajJackson · 2022-06-10T15:34:56Z

Describe the workflow you want to enable

You have some time series data that you wish to run through dbscan, only to find that your eps value is much lower than what produces expected results.

Consider

(Pdb) df['ts']
0     1.654272e+09
1     1.654272e+09
2     1.654272e+09
3     1.654272e+09
4     1.654272e+09
5     1.654272e+09
6     1.654272e+09
7     1.654272e+09
8     1.654272e+09
9     1.654272e+09
10    1.654272e+09
11    1.654272e+09
12    1.654272e+09
13    1.654272e+09
14    1.654272e+09
15    1.654272e+09
16    1.654272e+09
17    1.654272e+09
18    1.654272e+09
19    1.654272e+09
20    1.654272e+09
21    1.654272e+09
22    1.654272e+09
23    1.654272e+09
24    1.654273e+09
25    1.654273e+09
Name: ts, dtype: float64

(Pdb) max(np.diff(df['ts']))
8.0

It is reasonable for me to assume 8.0 as a fair eps value.

However

(Pdb) DBSCAN(eps=8, min_samples=10).fit_predict(df['ts'].to_numpy().reshape(-1,1))
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1])

Maybe you run some optimization that finds a value that works...

(Pdb) DBSCAN(eps=37, min_samples=10).fit_predict(df['ts'].to_numpy().reshape(-1,1))
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0])

But 37 isn't a good value.

One suggestion is to proportionally scale your data/eps (which is totally fine! There are probably other ways the user should deal with this.)

(Pdb) DBSCAN(eps=8, min_samples=10).fit_predict(df['ts'].to_numpy().reshape(-1,1))
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1])
(Pdb) df['ts'] /= 1000
(Pdb) DBSCAN(eps=8 / 1000, min_samples=10).fit_predict(df['ts'].to_numpy().reshape(-1,1))
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0])

Describe your proposed solution

Warn users when they are performing potentially imprecise clustering.

Describe alternatives you've considered, if relevant

I recognize this as a fair user error that may be out of scope for the project, but allowing for some kind of raise_precision_warnings option might be very helpful to many users. Perhaps for other algorithms as well.

Additional context

The text was updated successfully, but these errors were encountered:

YajJackson · 2022-06-21T13:44:07Z

After revisiting this, I found a preprocessing error on my end that was causing unexpected results. I'm going to close this because I am no longer able to reproduce the precision error I thought I was dealing with.

YajJackson added Needs Triage Issue requires triage New Feature labels Jun 10, 2022

YajJackson closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dbscan] [enhancement] Raise warning when dealing with numbers susceptible to precision errors #23584

[dbscan] [enhancement] Raise warning when dealing with numbers susceptible to precision errors #23584

YajJackson commented Jun 10, 2022 •

edited

YajJackson commented Jun 21, 2022

[dbscan] [enhancement] Raise warning when dealing with numbers susceptible to precision errors #23584

[dbscan] [enhancement] Raise warning when dealing with numbers susceptible to precision errors #23584

Comments

YajJackson commented Jun 10, 2022 • edited

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

YajJackson commented Jun 21, 2022

YajJackson commented Jun 10, 2022 •

edited