Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion working memory vs. performances #11506

Closed
jeremiedbb opened this issue Jul 13, 2018 · 12 comments
Closed

Discussion working memory vs. performances #11506

jeremiedbb opened this issue Jul 13, 2018 · 12 comments

Comments

@jeremiedbb
Copy link
Member

jeremiedbb commented Jul 13, 2018

I ran some benchmarks on KMeans performances when varying the working_memory (see #10280). I open this discussion as suggested by @rth in #11271. In KMeans, working memory is involved in the function pairwise_distances_argmin_min.

You can see benchmarks below. I benchmarked KMeans.fit on a problem with 100000 samples, 50 dimensions and 1000 clusters, on 3 different machines.
image
image
image
It seems that working memory has an impact on performances, and moreover that the optimal is close to the cpu cache size. I think the first has lot of noise because it was made on my machine with other processes running and also focuses on smaller working memories.

Even if the improvement could only be at most 2x, it's worth considering a modification of the default value of the working memory, which is currently 1000Mo. However, it depends on the cpu specs. Would it be possible to make working_memory be inferred from that ?

ping @ogrisel

@jeremiedbb
Copy link
Member Author

I also have benchmarks made on pairwise_dist_argmin_min directly.
image
image

@rth
Copy link
Member

rth commented Jul 13, 2018

Thanks for these benchmarks @jeremiedbb !

They are consistent with @jnothman 's earlier benchmarks in #10280 (comment)
, but according to the second benchmark #10280 (comment) there are cases where a smaller working_memory could slow things down unless I missed something. @jnothman would you know in which practical examples the second case could happen?

If I rerun the first set of benchmarks, divide all timing by the curve minimum (to be able to compare them on a linear scale), remove the log scale and add more points for working_memory between 1 and 10MB, I get,

figure_1

My L3 CPU cache is 3 MB, and the optimum here appears to be around 30 MB, which is rather consistent with your figures above.

However, it depends on the cpu specs. Would it be possible to make working_memory be inferred from that ?

Assuming we could detect the CPU L3 cache size (which doesn't seem very straightforward) what would be the relationship you think?

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jul 13, 2018

which is rather consistent with your figures above.

Not that consistent. I find the minimum to be around cpu cache. Yours seems to be 10 * cpu cache.

On which function did you make your benchmark ?

@rth
Copy link
Member

rth commented Jul 13, 2018

Of pairwise_distances_argmin_min, well maybe loosely consistent at least in the sense that they also don't suggest to use a 1GB sized working memory as we are doing now.

What's the cache size in your pairwise_dist_argmin_min benchmarks?

@jeremiedbb
Copy link
Member Author

What's the cache size in your pairwise_dist_argmin_min benchmarks?

Sorry I didn't tell. It's 4Mo.

what's the number of clusters ? and what the dtype of your arrays ?
I made all my benchmarks with dtype=np.float32.

@rth
Copy link
Member

rth commented Jul 13, 2018

See the gist with the benchmark code in the first link of my first comment.

@lesshaste
Copy link

https://github.com/workhorsy/py-cpuinfo seems to be the python tool of choice to get the L3 CPU cache size.

@jeremiedbb
Copy link
Member Author

@rth I ran your benchmark on my machine and got the same result. Min around 30Mo working memory.

that they also don't suggest to use a 1GB sized working memory as we are doing now.

Agreed. However I've no idea how the optimal working memory is related to the cpu specs. So should we use a lower fixed default like 32 or 64Mo ?

@jeremiedbb
Copy link
Member Author

In addition, the dtype of X in your benchmarks is np.float64. When I run the same benchmark with dtype=np.float32, I find the min at 60Mo. See below
image
image

Hum, actually I'm not sure but the chunk sizes seem wrongly computed in pairwise_distance_chunked. They are computed for float64. See the code below:

chunk_n_rows = get_chunk_n_rows(row_bytes=8 * _num_samples(Y),
                                max_n_rows=n_samples_X,
                                working_memory=working_memory)

row_bytes=8 * _num_samples(Y) assumes each cell of the array is stored on 8 bytes.
That means in all my benchmarks, the real working memory is twice the shown one.

We should make row_bytes computed accordingly to X.dtype. @jnothman You wrote that code right ? Can you confirm ?

@rth
Copy link
Member

rth commented Jul 14, 2018

We won't be able to add an external dependency like py-cpuinfo for this.

So should we use a lower fixed default like 32 or 64Mo ?

Yes, but what about #10280 (comment) ?

Hum, actually I'm not sure but the chunk sizes seem wrongly computed in pairwise_distance_chunked.

There was somewhat related discussion in #10280 (comment) but I'm not sure if this was desired or nor.

(I would test 64 bit rather than 32 bit by default given #9354.)

@jnothman
Copy link
Member

jnothman commented Jul 16, 2018 via email

@jeremiedbb
Copy link
Member Author

Given the recent improvements about pairwise distances + reductions, which are memory efficient and fast, I think the subject of this issue is getting out of date. pairwise_distances_chunked, pairwise_distances_argmin(_min) are destined to disappear at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants