Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Heirarchical Performance Testing (HPT) technique to compare_to? #168

Open
mdboom opened this issue Aug 25, 2023 · 3 comments
Open

Add Heirarchical Performance Testing (HPT) technique to compare_to? #168

mdboom opened this issue Aug 25, 2023 · 3 comments

Comments

@mdboom
Copy link
Collaborator

mdboom commented Aug 25, 2023

I recently came across a technique for distilling benchmark measurements into a single number that takes into account the fact that some benchmarks are more consistent/reliable than others, called Heirarchical Performance Testing (HPT). There is an implementation (in bash!!!) for the PARSEC benchmark suite. I ported it to Python and ran it over the big Faster CPython data set.

The results are pretty useful -- for example, while a lot of the main specialization work in 3.11 has a reliability of 100%, some recent changes to the GC have a speed improvement but with a lower reliability, accounting for the fact that GC changes have a lot more randomness (more moving parts and interactions with other things happening in the OS). I think this reliability number, along with the more stable "expected speedup at the 99th percentile", is a lot more useful for evaluating a change (especially small changes) than the geometric mean. I did not, however, see the massive 3.5x discrepancy between the 99th percentile number and the geometric mean reported in the paper (on a different dataset).

Is there interest in adding this metric to the output of pyperf's compare_to command?

@vstinner
Copy link
Member

In pyperf, I tried to give the choice to the user to decide how to display data and to not take decisions for them. That's why it stores all timings, and not just min / avg / max. If there is a way to render data differently without losing the old way, why not. The implementation looks quite complicated.

@mdboom
Copy link
Collaborator Author

mdboom commented Aug 28, 2023

Yes, to be clear, this wouldn't change how the raw data is stored in the .json files at all -- in fact, it's because all of the raw data is retained that this can easily be computed after data collection.

I would suggest adding a flag (e.g. --hpt) to the compare_to command that would add the values from HPT to the bottom of the report. Does that make sense? If so, I'll work up a PR. My current implementation uses Numpy, but for pyperf it's probably best not to add that as a dependency. A pure Python implementation shouldn't be unusably slow (it's not very heavy computation).

@vstinner
Copy link
Member

If it's a new option, and it doesn't change the default, i'm fine with it. Th problem is just how to explain it in the doc, shortly with simple words 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants