Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locally failing kmeans convergence test (WSL) #17428

Closed
amueller opened this issue Jun 2, 2020 · 18 comments · Fixed by #17959
Closed

locally failing kmeans convergence test (WSL) #17428

amueller opened this issue Jun 2, 2020 · 18 comments · Fixed by #17959

Comments

@amueller
Copy link
Member

amueller commented Jun 2, 2020

> pytest -sv sklearn/cluster/tests/test_k_means.py -k test_kmeans_convergence

fails for me for both algorithms:

>       assert km.n_iter_ < 300
E       AssertionError: assert 300 < 300
E        +  where 300 = KMeans(algorithm='full', n_clusters=5, n_init=1, random_state=0, tol=0).n_iter_

show_versions:

System:
    python: 3.8.3 (default, May 19 2020, 18:47:26)  [GCC 7.3.0]
executable: /home/andy/anaconda3/envs/sklearndev/bin/python
   machine: Linux-4.4.0-18362-Microsoft-x86_64-with-glibc2.10

Python dependencies:
          pip: 20.0.2
   setuptools: 46.4.0.post20200518
      sklearn: 0.24.dev0
        numpy: 1.18.1
        scipy: 1.4.1
       Cython: 0.29.17
       pandas: 1.0.3
   matplotlib: 3.1.3
       joblib: 0.15.1
threadpoolctl: 2.1.0

Built with OpenMP: True
@amueller
Copy link
Member Author

amueller commented Jun 2, 2020

cc @jeremiedbb

@amueller
Copy link
Member Author

amueller commented Jun 2, 2020

Inertia basically stays the same after iteration 50, but there seems to be some floating point issues maybe?

Iteration 48, inertia 40826.39061567664
Iteration 49, inertia 40826.37236625354
Iteration 50, inertia 40826.37236625354
Iteration 51, inertia 40826.37236625354
Iteration 52, inertia 40826.372366253534
Iteration 53, inertia 40826.372366253534
Iteration 54, inertia 40826.372366253534
Iteration 55, inertia 40826.372366253534
Iteration 56, inertia 40826.372366253534
Iteration 57, inertia 40826.372366253534
Iteration 58, inertia 40826.37236625354

showing center shift we can see:

Iteration 49, inertia 40826.37236625354, center shift 2.000920841596814e-31
Iteration 50, inertia 40826.37236625354, center shift 2.272958446241511e-31

@amueller
Copy link
Member Author

amueller commented Jun 2, 2020

Somewhat related to #17428.
I wonder if we should also check that no cluster centers are relabeled? This is tolerance is meant to be a weaker convergence criterion in my understanding. If nothing is relabeled, we certainly don't want to keep going.
cc @kno10

@amueller
Copy link
Member Author

amueller commented Jun 2, 2020

I see two solutions: remove the test because it's unstable, or add an extra condition into the convergence check for tol=0 to also check whether the labels changed.

@jeremiedbb
Copy link
Member

It does not come from relabelling indefinitely. When tol=0 we are checking that norm(old_centers - new_centers) == 0. Floating point errors can make the norm to never be exactly 0 but ~ machine precision.

I already added a comment in the docstring that using tol=0 is not advised.

@amueller
Copy link
Member Author

amueller commented Jun 2, 2020

No sure if you saw my comment just before yours:
#17428 (comment)

What I mean is that either we should make sure the test that's actually performed when using tol=0 is more stable than a floating point comparison, or we shouldn't have that setting in the tests.

@jeremiedbb
Copy link
Member

Initially this test was added in a PR which goal was to make sure that when tol=0, the iteration loop don't run until max_iter (due to strict inequality check before).

Since we can't guarantee that due to floating point errors, I think we should just remove the test.

@amueller
Copy link
Member Author

amueller commented Jun 2, 2020

I'm fine with that, but then maybe we should also raise a warning if someone provides tol=0? Asking for the classical behavior of no changes in assignments seems like a reasonable usecase but it would require carrying around the old labels. which is not too bad.

@jeremiedbb
Copy link
Member

jeremiedbb commented Jun 2, 2020

I don't have a preference. I'm fine with both. I'd also be fine with silently switch to tol=eps when user provides tol=0 :)

@kno10
Copy link
Contributor

kno10 commented Jun 3, 2020

The standard approach is simply to set a flag when any point is reassigned when computing the new assignment; and converge only when no point is relabeled.
I understand that providing a "tol" parameter is more in line with other machine learning algorithms that cannot converge otherwise. But since k-means finishes with "no more changes", this definitely should be an option to have. This is a feature of k-means; and quite some people will expect the algorithm to have reached such a stable state, where every point is really assigned to the closest center - if you have tol>0, that means there probably still is some point that is not assigned to the closest center. This may be confusing to users if the algorithm stops too early.
If a user wants floating-point tolerance convergence, he can pass tol=2e-308 or similar. tol=0 should really be a "no more reassignment happening" to be compatible with the common definition of the standard k-means algorithm as "stop if no point is reassigned".

@amueller
Copy link
Member Author

amueller commented Jun 3, 2020

@kno10 I agree that having the textbook definition seems useful, though it is mostly academic. With a dataset of any common size waiting for full convergence is probably very slow and not what you want. The 'finishing' also just means 'found a local optimum with a particular sets of local moves'. But indeed the stable state is a nice property.

I'm not sure what you mean by the "If the user wants a floating-point tolerance" statement. I think the question is whether it's acceptable to rewrite tol=0 as tol=eps and then use the current criterion which is equivalent to no more change for all practical purposes, or if we should actually track cluster re-assignment, which we could do for the case of tol=0 for little extra cost and just some added complexity. Personally I'd be happy to take a PR that does that, not sure if others agree though.

@kno10
Copy link
Contributor

kno10 commented Jun 3, 2020

It is not that slow to do full convergence.

First of all, few users actually have that big data where k-means makes sense to use in the first place. Big data is a big lie, most of the time. Most users are still in the megabyte range. And even if someone has gigabytes of data, that is usually before preparation and "vectorization". Once you get the data into shape for clustering, it tends to shrink (or you are probably doing something wrong anyway). So for most users, the number of iterations is of little concern. The others can set tolerance.

For good algorithms (Elkan and beyond), the first iteration is expensive - but the later iterations get cheaper and cheaper, because fewer points are recomputed. I've had benchmarks where the last 100 iterations took as long as the first iteration because only single points were recomputed, much less than N/100. Try it on the mnist data, plot runtime vs. number of iterations.

Note that having a boolean flag "something changed" can even be cheaper than the current tolerance computation. That also only adds O(1) memory. Since Lloyd's algorithm for example does not need to compute the center shift at all you "waste" O(k*d) operations to compute the center shift instead.

With floating point, I would be more concerned that in some rare corner cases, an oscillation could arise due to bad rounding. And you then do not reach a stable state. But I do not have proof that this can happen, and this then likely happens independently of tolerance or fix point convergence.

@kno10
Copy link
Contributor

kno10 commented Jun 3, 2020

#16083
had implemented hard convergence, but is outdated now. It probably can be closed.
But some of the issues pointed out there may still persist. Such as doing an extra assignment iteration at the end that may not be necessary if the method has converged fully. Probably the new version at least does not do the double assignment in the beginning anymore?

@amueller
Copy link
Member Author

amueller commented Jun 3, 2020

@kno10 I opened the issue because I had an oscillation on the test that you added. Oscillation in the change, not in the labels, of course.

It's very hard to assess what the 'average use case' is for a library like sklearn and I can see an argument being made in either direction. I don't think we'll change the default behavior for now but I'd be happy to see the fixes from #16083 be incorporate as far as they still apply. It's unfortunate that your work there conflicted with a major rewrite.

@amueller
Copy link
Member Author

amueller commented Jun 3, 2020

Btw do you have thoughts on selecting algorithms and other algorithms we should implement? The last time I checked the trade-offs were not super clear for all the algorithms. Elkan seems mostly to be problematic because of the memory requirement and ying-yang is better there but there's also been some other modifications afterwards IIRC.

@kno10
Copy link
Contributor

kno10 commented Jun 3, 2020

Oscillation in the change raises largely the same question: why and how does this happen. How can the change oscillate, but not the labels? When the centers are computed using the arithmetic mean, the only way they could oscillate is if the labels changed? Or is it because the center shift computation itself is numerically unstable (c.f., #9354) so that identical centers do not yield a distance of 0? Can you replace the distance computations there to not use the dot hack?

Yin-Yang k-means is a bit tricky for users because it adds additional tuning parameters. You then have to run k-means on the initial cluster centers, for example. In experiments here, the performance would usually be between Elkan and Hamerly (unsurprisingly); so the method is a tradeoff, but once you tune parameters you'll likely want to use either of the original methods.
Newlings approach currently seems to be one of the best in ELKI; and Hamerly is likely an easy by-product when implementing Newling. Hartigan-Wong may find better solutions (although probably only marginally) because the standard k-means fixpoints are not local optima. Hartigan-Wong is the default algorithm in R.

@ckastner
Copy link
Contributor

>       assert km.n_iter_ < 300
E       AssertionError: assert 300 < 300
E        +  where 300 = KMeans(algorithm='full', n_clusters=5, n_init=1, random_state=0, tol=0).n_iter_

When building the 0.23.1 packages for Debian, I too encountered these test failures (both 'full' and 'elkan' failed).

Reading the exchange above, the issue seems to already have been identified. I thought it couldn't hurt to add another data point.

We've locally disabled these tests for now.

@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2020

Oscillation in the change raises largely the same question: why and how does this happen. How can the change oscillate, but not the labels? When the centers are computed using the arithmetic mean, the only way they could oscillate is if the labels changed?

Since we now use OpenMP (thread-based) parallelism by default, I would not be surprised that we observe some non-deterministic rounding errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants