Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix KMeans convergence when tol==0 #17959

Merged
merged 10 commits into from Jul 28, 2020

Conversation

jeremiedbb
Copy link
Member

Fixes #17428

When tol==0, convergence based on the norm of the difference of the centers between 2 iterations can fail due to rounding errors making the norm not exactly 0 (and leave iterations go until max_iter is reached).

This PR implements convergence based on the difference of the labels between 2 iterations when tol == 0.

It does not impact performances. Here's a timing comparison between setting tol=0 and tol=1e-16 (both converge after 60 iterations to the same clustering):

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=100000, centers=100, n_features=10)

km = KMeans(n_clusters=100, tol=0, n_init=1, init='random', random_state=0)
%timeit km.fit(X)
2.13 s ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

km = KMeans(n_clusters=100, tol=1e-16, n_init=1, init='random', random_state=0)
%timeit km.fit(X)
2.14 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor remark:

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @jeremiedbb !

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
@jeremiedbb
Copy link
Member Author

I changed to always check convergence based on labels first and then on tol. I tested the impact on performance in the worst case, i.e. many samples, few centers, few features. It makes the computation of labels comparison as costly as possible compared to the other computations. In that case the drop of performance is rather small, see below. I think it's acceptable (recalling that it's in the worst situation. I many situations the drop is indetectable).

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np

X = np.random.RandomState(0).random_sample((1000000, 3))

km = KMeans(n_clusters=3, tol=1e-16, n_init=1, init='random', random_state=0)

# master
%timeit km.fit(X)
2.19 s ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

km.n_iter_
71

# this pr
%timeit km.fit(X)
2.26 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

km.n_iter_
71

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again, I agree with your perf analysis and I things it's ok to always do the check. Some nitpicks:

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2020

Another suggestion: the coverage report highlights the fact that the new convergence messages are not covered.

It might be interesting to extend the tests using the capsys fixture to make some assertion on those, in particular with tol=0 or 1e-100:

https://docs.pytest.org/en/stable/capture.html#accessing-captured-output-from-a-test-function

@jeremiedbb
Copy link
Member Author

I added a test checking all expected outputs of KMeans in verbose mode.
I'd rather not do the same for minibatch in this pr, but rather in #17622 since #17622 changes some prints.


assert km.n_iter_ < 300
assert km_full.n_iter_ == km_elkan.n_iter_ < max_iter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this assertion be robust to rounding errors that can affect Lloyd and Elkan slightly differently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a badly broken line anyway that is likely equivalent to

    assert int(km_full.n_iter_ == km_elkan.n_iter_) < max_iter

isn't it? And because int(boolean) is either 0 or 1, this will likely always be True (unless you use max_iter=1 in some test).

As for numerical issues: in theory it can happen that because of rounding issues Lloyd and Elkan in a rare case may produce different results. But I would rather test for them to do the same number of iterations and then study the situation if they ever not agree; if ever. That would be an interesting case to see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not broken. In python you can chain comparison operators. a == b < c is equivalent to (a == b) and (b < c).

It's not robust to rounding errors indeed. While I agree that it's interesting to check when such a situation happens, it's painful to have that in the test suite, because it's hard to debug when it suddently fails on the ci because some hardware change or some other reason idk. It happened with another test and we just ended up xfailing the test...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also there's already a test which checks that elkan and lloyd have the same n_iter_ in a situation where rounding errors can't interfere (test_kmeans_elkan_results iirc)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Please add an entry to the change log at doc/whats_new/v0.24.rst with tag |Fix|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@ogrisel ogrisel merged commit fc06bae into scikit-learn:master Jul 28, 2020
@jnothman jnothman modified the milestone: 0.23.2 Jul 28, 2020
@jnothman
Copy link
Member

Should this be in 0.23.2??

@ogrisel ogrisel added this to the 0.23.2 milestone Jul 29, 2020
@ogrisel
Copy link
Member

ogrisel commented Jul 29, 2020

Should this be in 0.23.2??

Seems easy to backport. Let's do it.

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 3, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

locally failing kmeans convergence test (WSL)
5 participants