Skip to content

Parallel Benchmarking

Tobias Pfeiffer edited this page Sep 18, 2019 · 2 revisions

Benchee offers you the parallel key to execute each benchmarking job in parallel. E.g. with parallel: 4 each defined benchmarking function will be spawned and executed in 4 tasks and then benchee waits until all 4 of those finish and it then goes on to benchmark the next function.

Benchee will treat the results of a parallel benchmark basically as if it was done sequentially, so the reported results will not account for the fact that they were obtained in parallel. The results will show what the average time to call a function was across all processes. So, if there was no slow down due to parallel execution (but there is, see next execution) executing with parallel: 4, time: 5 is more or less the same as executing with parallel: 1, time: 20.

While this is great it also carries certain risks.

Run time impact when running in parallel

First it's important to know that most modern CPUs have a power boost when not all cores are occupied. This means, even if the system is not overloaded benchmarking numbers will likely get worse the higher the parallelism is.

To showcase this and the effect of overloading I ran the same benchmark on my 4 core system, under normal working load (browser, music, editor, gui etc. open). As you will see 2 processes is just a bit slower (expected cause single core boost) and afterwards it degrades as somewhat expected. Also the relative performance between benchmarks stays somewhat the same, or well it even gets more apparent lots of times.

You can also note that the stamdard deviation gets progressively bigger as (I think) it happens more often that a thread/process has to wait to be scheduled for execution (either by the OS or the BEAM vm).

tobi@happy ~/github/benchee $ mix run samples/run.exs # parallel 1
Benchmarking flat_map...
Benchmarking map.flatten...

Name                  ips        average    deviation         median
map.flatten       1276.37       783.47μs    (±12.28%)       759.00μs
flat_map           878.60      1138.17μs     (±6.82%)      1185.00μs

Comparison: 
map.flatten       1276.37
flat_map           878.60 - 1.45x slower
tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs # parallel 2
Benchmarking flat_map...
Benchmarking map.flatten...

Name                  ips        average    deviation         median
map.flatten       1230.53       812.66μs    (±19.86%)       761.00μs
flat_map           713.82      1400.92μs     (±5.63%)      1416.00μs

Comparison: 
map.flatten       1230.53
flat_map           713.82 - 1.72x slower
tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs # parallel 3
Benchmarking flat_map...
Benchmarking map.flatten...

Name                  ips        average    deviation         median
map.flatten       1012.77       987.39μs    (±29.53%)       913.00μs
flat_map           513.44      1947.63μs     (±6.91%)      1943.50μs

Comparison: 
map.flatten       1012.77
flat_map           513.44 - 1.97x slower
tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs # parallel 4
Benchmarking flat_map...
Benchmarking map.flatten...

Name                  ips        average    deviation         median
map.flatten        954.88      1047.25μs    (±34.02%)       957.00μs
flat_map           452.38      2210.55μs    (±21.05%)      1914.00μs

Comparison: 
map.flatten        954.88
flat_map           452.38 - 2.11x slower
tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs # parallel 12
Benchmarking flat_map...
Benchmarking map.flatten...

Name                  ips        average    deviation         median
map.flatten        296.63      3371.18μs    (±57.60%)      2827.00μs
flat_map           186.96      5348.74μs    (±42.14%)      5769.50μs

Comparison: 
map.flatten        296.63
flat_map           186.96 - 1.59x slower

Of course, overloading the system with 12 processes is very contra productive and a lot slower than it ought to be :D

Stress Testing

Of course, if you want to see how a system behaves under load - overloading might be exactly what you want to stress test the system. This is the original use case why this was introduced, in the words of the contributor:

I needed to benchmark integration tests for a telephony system we wrote - with this system the tests actually interfere with each other (they're using an Ecto repo) and I wanted to see how far I could push the system as a whole. Making this small change to Benchee worked perfectly for what I needed :)