Discussion working memory vs. performances #11506

jeremiedbb · 2018-07-13T18:26:31Z

I ran some benchmarks on KMeans performances when varying the working_memory (see #10280). I open this discussion as suggested by @rth in #11271. In KMeans, working memory is involved in the function pairwise_distances_argmin_min.

You can see benchmarks below. I benchmarked KMeans.fit on a problem with 100000 samples, 50 dimensions and 1000 clusters, on 3 different machines.

It seems that working memory has an impact on performances, and moreover that the optimal is close to the cpu cache size. I think the first has lot of noise because it was made on my machine with other processes running and also focuses on smaller working memories.

Even if the improvement could only be at most 2x, it's worth considering a modification of the default value of the working memory, which is currently 1000Mo. However, it depends on the cpu specs. Would it be possible to make working_memory be inferred from that ?

ping @ogrisel

The text was updated successfully, but these errors were encountered:

jeremiedbb · 2018-07-13T18:30:26Z

I also have benchmarks made on pairwise_dist_argmin_min directly.

rth · 2018-07-13T21:17:35Z

Thanks for these benchmarks @jeremiedbb !

They are consistent with @jnothman 's earlier benchmarks in #10280 (comment)
, but according to the second benchmark #10280 (comment) there are cases where a smaller working_memory could slow things down unless I missed something. @jnothman would you know in which practical examples the second case could happen?

If I rerun the first set of benchmarks, divide all timing by the curve minimum (to be able to compare them on a linear scale), remove the log scale and add more points for working_memory between 1 and 10MB, I get,

My L3 CPU cache is 3 MB, and the optimum here appears to be around 30 MB, which is rather consistent with your figures above.

However, it depends on the cpu specs. Would it be possible to make working_memory be inferred from that ?

Assuming we could detect the CPU L3 cache size (which doesn't seem very straightforward) what would be the relationship you think?

jeremiedbb · 2018-07-13T21:33:19Z

which is rather consistent with your figures above.

Not that consistent. I find the minimum to be around cpu cache. Yours seems to be 10 * cpu cache.

On which function did you make your benchmark ?

rth · 2018-07-13T21:37:28Z

Of pairwise_distances_argmin_min, well maybe loosely consistent at least in the sense that they also don't suggest to use a 1GB sized working memory as we are doing now.

What's the cache size in your pairwise_dist_argmin_min benchmarks?

jeremiedbb · 2018-07-13T21:42:37Z

What's the cache size in your pairwise_dist_argmin_min benchmarks?

Sorry I didn't tell. It's 4Mo.

what's the number of clusters ? and what the dtype of your arrays ?
I made all my benchmarks with dtype=np.float32.

rth · 2018-07-13T21:47:49Z

See the gist with the benchmark code in the first link of my first comment.

lesshaste · 2018-07-14T06:19:05Z

https://github.com/workhorsy/py-cpuinfo seems to be the python tool of choice to get the L3 CPU cache size.

jeremiedbb · 2018-07-14T10:40:28Z

@rth I ran your benchmark on my machine and got the same result. Min around 30Mo working memory.

that they also don't suggest to use a 1GB sized working memory as we are doing now.

Agreed. However I've no idea how the optimal working memory is related to the cpu specs. So should we use a lower fixed default like 32 or 64Mo ?

jeremiedbb · 2018-07-14T13:05:04Z

In addition, the dtype of X in your benchmarks is np.float64. When I run the same benchmark with dtype=np.float32, I find the min at 60Mo. See below

Hum, actually I'm not sure but the chunk sizes seem wrongly computed in pairwise_distance_chunked. They are computed for float64. See the code below:

chunk_n_rows = get_chunk_n_rows(row_bytes=8 * _num_samples(Y),
                                max_n_rows=n_samples_X,
                                working_memory=working_memory)

row_bytes=8 * _num_samples(Y) assumes each cell of the array is stored on 8 bytes.
That means in all my benchmarks, the real working memory is twice the shown one.

We should make row_bytes computed accordingly to X.dtype. @jnothman You wrote that code right ? Can you confirm ?

rth · 2018-07-14T14:34:42Z

We won't be able to add an external dependency like py-cpuinfo for this.

So should we use a lower fixed default like 32 or 64Mo ?

Yes, but what about #10280 (comment) ?

Hum, actually I'm not sure but the chunk sizes seem wrongly computed in pairwise_distance_chunked.

There was somewhat related discussion in #10280 (comment) but I'm not sure if this was desired or nor.

(I would test 64 bit rather than 32 bit by default given #9354.)

jnothman · 2018-07-16T02:28:17Z

sorry I don't have much attention for this right now. yes, I may have forgotten to make the output dtype size configurable. I did not however expect us to get a perfect parameterisation on the first shot, and needed, in the first instance, a sensible mechanism for controlling algorithms where unbounded memory usage was causing problems. I had also initially found 64 MiB was appropriate, but increased the value given others' benchmarks. Again, I don't think this requires fine-tuning as long as the API is stable, but I would happily see it deceased an order of magnitude or calculated from system specs where readily available.

jeremiedbb · 2022-09-23T13:51:42Z

Given the recent improvements about pairwise distances + reductions, which are memory efficient and fast, I think the subject of this issue is getting out of date. pairwise_distances_chunked, pairwise_distances_argmin(_min) are destined to disappear at some point.

rth mentioned this issue Jul 16, 2018

Numerical precision of euclidean_distances with float32 #9354

Closed

cmarmo added module:cluster Performance labels Jan 17, 2022

jeremiedbb closed this as completed Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion working memory vs. performances #11506

Discussion working memory vs. performances #11506

jeremiedbb commented Jul 13, 2018 •

edited

jeremiedbb commented Jul 13, 2018

rth commented Jul 13, 2018

jeremiedbb commented Jul 13, 2018 •

edited

rth commented Jul 13, 2018

jeremiedbb commented Jul 13, 2018

rth commented Jul 13, 2018

lesshaste commented Jul 14, 2018

jeremiedbb commented Jul 14, 2018

jeremiedbb commented Jul 14, 2018

rth commented Jul 14, 2018

jnothman commented Jul 16, 2018 via email

jeremiedbb commented Sep 23, 2022

Discussion working memory vs. performances #11506

Discussion working memory vs. performances #11506

Comments

jeremiedbb commented Jul 13, 2018 • edited

jeremiedbb commented Jul 13, 2018

rth commented Jul 13, 2018

jeremiedbb commented Jul 13, 2018 • edited

rth commented Jul 13, 2018

jeremiedbb commented Jul 13, 2018

rth commented Jul 13, 2018

lesshaste commented Jul 14, 2018

jeremiedbb commented Jul 14, 2018

jeremiedbb commented Jul 14, 2018

rth commented Jul 14, 2018

jnothman commented Jul 16, 2018 via email

jeremiedbb commented Sep 23, 2022

jeremiedbb commented Jul 13, 2018 •

edited

jeremiedbb commented Jul 13, 2018 •

edited