Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More copy-able result for @benchmark #236

Open
Moelf opened this issue Jul 15, 2021 · 19 comments · May be fixed by #260
Open

More copy-able result for @benchmark #236

Moelf opened this issue Jul 15, 2021 · 19 comments · May be fixed by #260

Comments

@Moelf
Copy link
Contributor

Moelf commented Jul 15, 2021

Sometimes users want to copy the result for comparison where @btime gives too little information but the current @benchmark is unfriendly for that.

One proposal is to change now:

julia> @benchmark sin(3)
BechmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.301 ns  15.216 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.384 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.456 ns ±  0.294 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █     ▅                                           ▂   
  ▅▃▂▁▁▃█▆▃▂▂▄█▄▃▂▁▂▄▃▂▂▁▁▃▆▃▂▁▂▁▂▅▃▂▁▁▁▂▄▄▂▂▂▁▁▂▄▃▃▂▁▁▁▁▅█▄ ▃
  1.3 ns         Histogram: frequency by time        1.65 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

to

julia> @benchmark sin(3)
BechmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.301 ns  15.216 ns 
 Time  (median):     1.384 ns              
 Time  (mean ± σ):   1.456 ns ±  0.294 ns
 -----------------------------------------
 GC (min  max): 0.00%  0.00%
 GC (median):    0.00%
 GC (mean ± σ):  0.00% ± 0.00%

        █     ▅                                           ▂   
  ▅▃▂▁▁▃█▆▃▂▂▄█▄▃▂▁▂▄▃▂▂▁▁▃▆▃▂▁▂▁▂▅▃▂▁▁▁▂▄▄▂▂▂▁▁▂▄▃▃▂▁▁▁▁▅█▄ ▃
  1.3 ns         Histogram: frequency by time        1.65 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

so that people can copy the first 3-4 lines without sending really long, wrapped lines.

Besides I think it's much easier to navigate vertically anyway

@vchuravy
Copy link
Member

@tecosaur any easy improvements we could do?

Maybe add a compress variant of show?

@tecosaur
Copy link
Collaborator

I can see the value in this, it's just a matter of how to best do it. Unfortunately, I've been rather busy the past few weeks, and will be for the immediate future — so you'll see a lot me or me commenting than me doing 😛.

@Moelf
Copy link
Contributor Author

Moelf commented Aug 25, 2021

let me know what the desired syntax / behavior should be and I can give it a try.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 10, 2021

addition: I think Memory estimate: 0 bytes, allocs estimate: 0. should also got to the top where can be copied together.

@mcabbott
Copy link

mcabbott commented Sep 10, 2021

I suspect we could remove a lot of details without losing much. Has anyone ever got any value out of median GC percentage (whatever that is)? Putting the most important information first, you could do for instance:

julia> @benchmark sin(3)
BechmarkTools.Trial:
 Time: min 1.301 ns, mean 1.456 ns, max 15.216 ns 
 Memory: 0 allocations, 0 bytes
 GC: min 0.0 ns (0.0%), max 0.0 ns (0.0%)
 10_000 samples with 1_000 evaluations:

        █     ▅                                           ▂   
  ▅▃▂▁▁▃█▆▃▂▂▄█▄▃▂▁▂▄▃▂▂▁▁▃▆▃▂▁▂▁▂▅▃▂▁▁▁▂▄▄▂▂▂▁▁▂▄▃▃▂▁▁▁▁▅█▄ ▃
  1.3 ns                                             1.65 ns <

And perhaps "0 bytes// GC: min 0.0ns ..." can be omitted entirely when these are all zero, without being a confusing change of layout.

Since nobody runs just one benchmark, the case we should design for is having several on screen -- so being compact, and having all the information in one block, seem desirable.

The fact that the histogram auto-scales to look nice for one run works against this, e.g. in #237. While it's a bigger change than moving text around, a smart suggestion (Edit -- from #253) was that perhaps the histogram should always have 0 at the left and a power of 10 on the right. Then (very often) they would be comparable between different functions.

@timholy
Copy link
Member

timholy commented Sep 18, 2021

I don't disagree with the concerns, but for the proposed solutions I have a couple of counterpoints:

  • yes, I do like to paste benchmark results, and yes the oddities (esp. the vertical break) in the histogram are a bit weird. But even more, I want it to be useful when it's just me and my own laptop. Vertical space is more important now that we can align the horizontal axis across runs (Support manual range for hist display #255 )
  • If we're slimming down what we print, we should still include some measure of variability, because interpreting differences (e.g., between two implementations) is always dicey if you have no measure of variability. Unfortunately, the total range (min..max) is almost useless. While standard deviation is much better than total range, it's still quite heavily influenced by outliers because each deviation is squared before averaging. Mean absolute deviation is more robust, and naturally pairs with the median (https://stats.stackexchange.com/a/7312). Hence I'd propose that the one required element of our textual summary should be median ± mean absolute deviation.

@timholy
Copy link
Member

timholy commented Sep 18, 2021

Here's a real-world example of just how bad the standard deviation is for these purposes:

julia> show(ioctxfalse, MIME("text/plain"), b)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  18.860 μs  764.893 μs  ┊ GC (min  max): 0.00%  93.99%
 Time  (median):     19.471 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.375 μs ±  12.209 μs  ┊ GC (mean ± σ):  0.99% ±  1.65%

   █▃ ▃▁                                                        
  ▇██▆██▄▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  18.9 μs         Histogram: frequency by time         28.2 μs <

 Memory estimate: 15.84 KiB, allocs estimate: 4.
julia> meanad(x, med=median(x)) = mean(abs, x .- med)
meanad (generic function with 2 methods)

julia> meanad(b.times)    # in nanoseconds
1224.1091

julia> std(b.times)
12209.450625612251

I think anyone looking at that histogram would agree that 1us is much closer to the typical variability than 12us:

julia> sum(b.times .< median(b.times) + meanad(b.times)) / length(b.times)
0.8379

since 84% of the samples are less than median + 1 mean absolute deviation (20.7us).

@mcabbott
Copy link

mcabbott commented Sep 18, 2021

In defence of brevity, it seems that min 18.860 μs, mean 20.375 μs does a fair job of capturing the "about 1us wide" sentiment here. And the left tail is usually truncated like this, it's hard to have extreme outliers.

Unlike the right tail, of course. The max might have separately nice-to-know information, but (IMO) this right tail means the standard deviation is often useless.

@timholy
Copy link
Member

timholy commented Sep 18, 2021

The mean can be way, way off too. min and median might be a fine choice, though.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 18, 2021

I don't mind showing more numbers, it's just trying editing this:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  18.860 μs  764.893 μs  ┊ GC (min  max): 0.00%  93.99%
 Time  (median):     19.471 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.375 μs ±  12.209 μs  ┊ GC (mean ± σ):  0.99% ±  1.65%

   █▃ ▃▁                                                        
  ▇██▆██▄▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  18.9 μs         Histogram: frequency by time         28.2 μs <

 Memory estimate: 15.84 KiB, allocs estimate: 4.

into

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  18.860 μs  764.893 μs
 Time  (median):     19.471 μs              
 Time  (mean ± σ):   20.375 μs ±  12.209 μs
 Memory estimate: 15.84 KiB, allocs estimate: 4.

takes way too many mouse clicks right now. Try pasting the above example 3 times into some 55 characters wide Slack thread panel.

When there's no color encoding, 15 numbers and blocks wrapped around lines is just way too hard to spot key information.

@mcabbott
Copy link

mcabbott commented Sep 18, 2021

In a chat now lost, I was persuaded by someone that a fairly common situation is to have (say) 90% of samples almost at the minimum, and 10% a bit higher when GC kicks in (plus perhaps a high maximum). In this kind of distribution the median is less useful than the mean. (Edit: #237 (comment) is a nice example.)

The reason to print the histogram is that there's no one or two perfect numbers that are always what you should watch. But 5 (or in fact 7) times seems like a lot, it's hard to scan the page and remember where to look. Something like min/mean/max or min/median/max which has an obvious sequence would I think make it much easier to find the one you want; I guess min/median/mean/max isn't too bad either.

5 different GC numbers (of uncertain meaning) is way too many. Not printing them at all when they are all zero (which is common) would be great.

@timholy
Copy link
Member

timholy commented Sep 19, 2021

Fair enough, if we're showing min and one other number, then mean is a better choice than median. I'd still guess that median and mean absolute deviation would be even more informative, but I'm too lazy to really check by e.g., computing the mutual information with the histogram itself. I'm fine with min and mean.

@mcabbott
Copy link

mcabbott commented Sep 19, 2021

Here's an attempt... with #237 (comment) 's suggestion of 95th 99th percentile instead of max.

  • I think putting the times first, and in order, makes them easier to find.
  • The 99th percentile was already the right hand limit of the histogram, now less of a mystery? I've rounded these more aggressively too.
  • I added markers to the highlighted bars of the histogram.
  • It prints a few GC times, when there are allocations. Maybe the right ones? Hopefully clear what the percentages are?

Screenshot 2021-09-20 at 08 18 55

When copied:

julia> @benchmark sum(log.(x))  setup=(x=rand(1000))
BenchmarkTools.Trial:
│  minimum 3.994 μs, median 4.607 μs (½), mean 5.036 μs (*), 99ᵗʰ 6.423 μs
│  1 allocations, 7.94 KiB
│  GC time: mean 379.234 ns (7.53%), maximum 345.470 μs (98.25%)
│  10_000 samples, each 7 evaluations:
│                      ½           *                                           
│                  ▂▄▇█▇▆▂▂                                                    
│  ▁▁▂▂▂▂▃▄▄▅▄▄▄▄▅▇█████████▇▅▅▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁ ▃
└  3.9 μs                                                              6.5 μs +

@timholy
Copy link
Member

timholy commented Sep 20, 2021

It's hard to tell because the actual data collected are different, but it seems to reprint the median, not the mean? (Trick: use b = @benchmark ... and then toggle back and forth between the two displays with ;git stash and ;git stash pop assuming you're using Revise.)

But in general terms, the new display looks more easily interpretable for most users, and I like the decoration with markers for median and mean in the histogram!

@mcabbott
Copy link

mcabbott commented Sep 20, 2021

It was indeed printing things twice, sorry, now fixed. (Weirdly Revise isn't working for me, for the first time ever. my fault!)

@timholy
Copy link
Member

timholy commented Sep 21, 2021

LGTM!

@tecosaur
Copy link
Collaborator

I really like the ½ and the *. However, personally I'd lay it out slightly differently.

julia> @benchmark sum(log.(x))  setup=(x=rand(1000))
BenchmarkTools.Trial: 10,000 samples, each 7 evaluations:
│  minimum 3.994 μs, median 4.607 μs (½), mean 5.036 μs (*), 99ᵗʰ 6.423 μs
│  GC time: mean 379.234 ns (7.53%), maximum 345.470 μs (98.25%)
│  1 allocations, 7.94 KiB
│                      ½           *                                           
│                  ▂▄▇█▇▆▂▂                                                    
│  ▁▁▂▂▂▂▃▄▄▅▄▄▄▄▅▇█████████▇▅▅▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁ ▃
└  3.9 μs                                                         6.5 μs +
  • Put the two time lines (min/med/mean/99 and GC) together before the memory (allocs) line
  • Move the number of samples back up with the trials
  • Reduce the default width of the histogram to be no wider than the text
  • Use a comma instead of an underscore when printing 10000 (minor stylistic choice)

I haven't worked this into the above, but I think it would be nice to have the stdev of the GC at least. For example, consider these two sets of example GC times

  • 9, 10, 12, 13, 14, 15
  • 0, 0.1, 0.1, 0.2, 15, 60

Both have a mean just over 12, but the first has a stdev of 2.3 while the second has a stdev of 24. So, I consider this information helpful in gauging how variable the GC time is.

@Moelf
Copy link
Contributor Author

Moelf commented Sep 21, 2021

I think GC should go below allocation. We always care about time, and we always want to know if the allocation is larger than we expected (sometimes we expect 0, or expect N * bytes etc.). But GC can be something we don't or can't worry about. (and they can be zero)

@mcabbott
Copy link

Thanks for taking a close look. To explain my thinking a bit:

  • The GC times are below "2 allocations..." partly to give the 1st line breathing space, and partly so that you get to "0 allocations" before you get to the things which aren't there, when they aren't there. While all the times on the first line are comparable, different measures of the same thing, the GC times are not.

  • One reason to avoid 10,000 is that in many people's first language this is how decimals are written. The other reason is:

julia> 10,000
(10, 0)

Another possibility might be 10^4, since for large numbers it seems to have a strong preference for powers of 10.

  • The reason I moved the number of samples down is again partly breathing space for the 1st line, and partly to put the most important information first. If you want to copy the first few lines, this isn't the most important fact. Finally, the samples are literally what's plotted in the histogram, so it seems nice to say "10_000 samples, here there are:".

  • "stdev of the GC" is well-defined, but is it useful? Most runs usually have 0 GC time, but the point of the mean is to measure how much garbage you are generating per run (a property of your algorithm), regardless of how often the collector cleans up (which it decides, not you). I had it printing the 99th percentile GC time too, for a bit, but this is also usually zero.

julia> b = @benchmark sum(log.(x))  setup=(x=rand(1000));

julia> count(!iszero, b.gctimes), length(b.gctimes)
(7, 10000)

julia> mean(b.gctimes)  # 402 ns
402.1309428571428

julia> std(b.gctimes)  # 15 μs
15374.208785130586

julia> quantile(b.gctimes, 0.99)
0.0
  • The histogram is set to fit in 80 characters for now. I wondered about making it auto-size to match the first line, but then lines with times XXX.YYY will make it longer, so different ones won't line up.

Printing XXX.YYY is certainly too many digits. We could change the format to X.YYY, XX.YY, XXX.Y whatever that's called, which is still more than are justified but more consistent? Then they would be fixed width. I presume the reason this wasn't done is that groups of 3 fit with the steps between ms, μs, ns?

I also wonder if there should be less digits on the histogram ends. Partly in the hope that they'd more often line up between subsequent runs. Presently it rounds to 2 figures. Rounding to 1 means a step from 1.0 to 2.0 which is graphically too big, but maybe 10 steps per decade can be done nicely, not sure.

  • I made bars with zero counts print in light gray, to distinguish some from none. These will probably disappear on some people's themes (like paths in stack traces / @warn) but that's not super-confusing.

  • I thought the final < should be >, as it's said "less than", but perhaps there is no right way around... perhaps + is less ambiguous? Or something else like ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants