Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Box Plot Summary Charts #377

Open
thomasahle opened this issue Apr 13, 2023 · 12 comments
Open

Suggestion: Box Plot Summary Charts #377

thomasahle opened this issue Apr 13, 2023 · 12 comments

Comments

@thomasahle
Copy link
Contributor

thomasahle commented Apr 13, 2023

As the number of ANN libraries grow, the qps vs recall curves are getting harder to read.
On top of that, each curve only represents one dataset.
Could there possibly be a graphical representation that includes all the information from all the curves in a single, easy to read plot?

I suggest taking inspiration from The Computer Language 23.03 Benchmarks Game.
fastest

In more detail :

  1. Convert each plot into a single score for each algorithms. I suggest taking the area under the curve from recall=.5 to recall=1.
  2. For each dataset, divide the score of each algorithm by the largest auc of any algorithm. This makes the scores easy to compare between datasets.
  3. Compute the mean normalized auc for each algorithm, and list them in order of high to low. Possibly also including standard deviations as in the attached box plot summary chart.

It should be pretty easy to compute, given the raw data. I don't think that is in the git repository though?

@maumueller
Copy link
Collaborator

Great suggestion, would be interesting to see such a plot. Do you have the time to contribute this, Thomas? (Because I don't :-( )

I've uploaded the raw results from the December 2021 run here. The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

With regard to the AUC, I suppose you don't want to go all the way up to recall 1. but rather stop at .9 or .95. Highest QPS with recall above .9 is even simpler?

@erikbern
Copy link
Owner

My hesitation with box plots is it forces us to pick more parameters.

What about just breaking up the chart in a few different subplots with say 5 algos on each? We can do this alphabetically, the bucketing is not super important.

@thomasahle
Copy link
Contributor Author

I've uploaded the raw results from the December 2021 run here. The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

Nice, have you thought about putting it on the github repo as a "release"? You can host files there up to 2GB. Your file is 2.17GB, so either compress it a bit more, or split it in two.

With regard to the AUC, I suppose you don't want to go all the way up to recall 1. but rather stop at .9 or .95. Highest QPS with recall above .9 is even simpler?

I was going to just add a point at (qps=0, recall=1) to all the curves.
I think AUC as a measure for summarizing this kind of tradeoff curve is quite standard, and it fits well with how we perceive the curves when looking at them.

An alternative could be sampling the qps at recalls .5, .6, etc., and considering those as separate tasks/datasets. That is, for each recall we find the top qps and divide every other score by that, then add it to the pool of values for the boxplot.

What about just breaking up the chart in a few different subplots with say 5 algos on each? We can do this alphabetically, the bucketing is not super important.

This solves the first problem (charts are hard to read), but makes the second problem worse (there are too many charts).
Though, something that might be interesting is to "bucket" based on build time / memory usage. That way it would be like partitioning in boxing weight classes.

@maumueller
Copy link
Collaborator

I've uploaded the raw results from the December 2021 run here. The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

Nice, have you thought about putting it on the github repo as a "release"? You can host files there up to 2GB. Your file is 2.17GB, so either compress it a bit more, or split it in two.

I didn't know that the limit is that high, good suggestion.

@thomasahle
Copy link
Contributor Author

The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

What does this script actually do? It seems to be downloading every dataset to my computer, even though I was just interested in the timing results. Also, how are the timing results 6GB?

@thomasahle
Copy link
Contributor Author

Ok, this is what it looks like, using the above process:
Figure_1

Code here: https://gist.github.com/thomasahle/edf90cad1c074c0f589ae46be758485d

@erikbern
Copy link
Owner

Nice – Ii think this is clearly much better from a legibility perspective, but a bit trickier from an interpretability perspective.

Is there some way to make the score (y-axis) something that's more natural? Like I'm thinking if we you can fix the recall to some distribution and compute the average weighted QPS as the score, or vice versa (fix the QPS to some distribution and compute the average weighted recall). Would something like that work?

@thomasahle
Copy link
Contributor Author

You are right that the tricky thing is the freedom we get in defining the score.
We could use a different recall distribution. The above plot basically corresponds to the uniform distribution over [.5, 1].

There's also a question of whether to use log(qps) or qps when computing the area.
I plotted log(qps) here: https://twitter.com/thomasahle/status/1646594694116417536
It doesn't actually shift things very much. It might be (hopefully) that most reasonable transformations result in roughly the same plot.

@maumueller
Copy link
Collaborator

Nice work! I agree both with it being a great summary that is difficult to interpret. It's also very curios that the hnsw implementation score so differently.

The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

What does this script actually do? It seems to be downloading every dataset to my computer, even though I was just interested in the timing results. Also, how are the timing results 6GB?

The size comes from the raw results containing all actual answers to search queries to allow for using other metrics without re-computation. It downloads the datasets because the ground truth is attached to those. (But since the raw results probably had the recall values cached, this was probably not necessary.)

@thomasahle
Copy link
Contributor Author

Is this more interpretable? It uses the same language as the Benchmarks Game.

Figure_1

@erikbern
Copy link
Owner

Yeah, I think so – this is just normalized 1/QPS right? Maybe just plotting QPS would make it more interpretable?

Happy to play around with this too btw. I want to prioritize running the benchmarks first though, but I'll soon get to the point of having to plot the results :)

@thomasahle
Copy link
Contributor Author

this is just normalized 1/QPS right?

It's actually 1/(normalized QPS), that was just easier. But yeah, should probably do normalized 1/QPS instead :-)

Maybe just plotting QPS would make it more interpretable?

The only reason for normalizing is to have a standard scale for each dataset.

Happy to play around with this too btw. I want to prioritize running the benchmarks first though, but I'll soon get to the point of having to plot the results :)

Definitely, happy to leave it to you from here :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants