Suggestion: Box Plot Summary Charts #377

thomasahle · 2023-04-13T06:23:07Z

As the number of ANN libraries grow, the qps vs recall curves are getting harder to read.
On top of that, each curve only represents one dataset.
Could there possibly be a graphical representation that includes all the information from all the curves in a single, easy to read plot?

I suggest taking inspiration from The Computer Language 23.03 Benchmarks Game.

In more detail :

Convert each plot into a single score for each algorithms. I suggest taking the area under the curve from recall=.5 to recall=1.
For each dataset, divide the score of each algorithm by the largest auc of any algorithm. This makes the scores easy to compare between datasets.
Compute the mean normalized auc for each algorithm, and list them in order of high to low. Possibly also including standard deviations as in the attached box plot summary chart.

It should be pretty easy to compute, given the raw data. I don't think that is in the git repository though?

maumueller · 2023-04-13T08:27:50Z

Great suggestion, would be interesting to see such a plot. Do you have the time to contribute this, Thomas? (Because I don't :-( )

I've uploaded the raw results from the December 2021 run here. The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

With regard to the AUC, I suppose you don't want to go all the way up to recall 1. but rather stop at .9 or .95. Highest QPS with recall above .9 is even simpler?

erikbern · 2023-04-13T13:40:58Z

My hesitation with box plots is it forces us to pick more parameters.

What about just breaking up the chart in a few different subplots with say 5 algos on each? We can do this alphabetically, the bucketing is not super important.

thomasahle · 2023-04-13T16:23:11Z

I've uploaded the raw results from the December 2021 run here. The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

Nice, have you thought about putting it on the github repo as a "release"? You can host files there up to 2GB. Your file is 2.17GB, so either compress it a bit more, or split it in two.

With regard to the AUC, I suppose you don't want to go all the way up to recall 1. but rather stop at .9 or .95. Highest QPS with recall above .9 is even simpler?

I was going to just add a point at (qps=0, recall=1) to all the curves.
I think AUC as a measure for summarizing this kind of tradeoff curve is quite standard, and it fits well with how we perceive the curves when looking at them.

An alternative could be sampling the qps at recalls .5, .6, etc., and considering those as separate tasks/datasets. That is, for each recall we find the top qps and divide every other score by that, then add it to the pool of values for the boxplot.

What about just breaking up the chart in a few different subplots with say 5 algos on each? We can do this alphabetically, the bucketing is not super important.

This solves the first problem (charts are hard to read), but makes the second problem worse (there are too many charts).
Though, something that might be interesting is to "bucket" based on build time / memory usage. That way it would be like partitioning in boxing weight classes.

maumueller · 2023-04-13T16:35:39Z

I've uploaded the raw results from the December 2021 run here. The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

Nice, have you thought about putting it on the github repo as a "release"? You can host files there up to 2GB. Your file is 2.17GB, so either compress it a bit more, or split it in two.

I didn't know that the limit is that high, good suggestion.

thomasahle · 2023-04-13T17:38:56Z

The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

What does this script actually do? It seems to be downloading every dataset to my computer, even though I was just interested in the timing results. Also, how are the timing results 6GB?

thomasahle · 2023-04-13T19:07:52Z

Ok, this is what it looks like, using the above process:

Code here: https://gist.github.com/thomasahle/edf90cad1c074c0f589ae46be758485d

erikbern · 2023-04-13T19:15:27Z

Nice – Ii think this is clearly much better from a legibility perspective, but a bit trickier from an interpretability perspective.

Is there some way to make the score (y-axis) something that's more natural? Like I'm thinking if we you can fix the recall to some distribution and compute the average weighted QPS as the score, or vice versa (fix the QPS to some distribution and compute the average weighted recall). Would something like that work?

thomasahle · 2023-04-13T19:34:09Z

You are right that the tricky thing is the freedom we get in defining the score.
We could use a different recall distribution. The above plot basically corresponds to the uniform distribution over [.5, 1].

There's also a question of whether to use log(qps) or qps when computing the area.
I plotted log(qps) here: https://twitter.com/thomasahle/status/1646594694116417536
It doesn't actually shift things very much. It might be (hopefully) that most reasonable transformations result in roughly the same plot.

maumueller · 2023-04-13T20:01:52Z

Nice work! I agree both with it being a great summary that is difficult to interpret. It's also very curios that the hnsw implementation score so differently.

The easiest way forward is probably to use data_export.py to create a csv file summarizing all results and creating the boxplot from that.

What does this script actually do? It seems to be downloading every dataset to my computer, even though I was just interested in the timing results. Also, how are the timing results 6GB?

The size comes from the raw results containing all actual answers to search queries to allow for using other metrics without re-computation. It downloads the datasets because the ground truth is attached to those. (But since the raw results probably had the recall values cached, this was probably not necessary.)

thomasahle · 2023-04-13T22:01:51Z

Is this more interpretable? It uses the same language as the Benchmarks Game.

erikbern · 2023-04-13T22:31:38Z

Yeah, I think so – this is just normalized 1/QPS right? Maybe just plotting QPS would make it more interpretable?

Happy to play around with this too btw. I want to prioritize running the benchmarks first though, but I'll soon get to the point of having to plot the results :)

thomasahle · 2023-04-13T22:45:06Z

this is just normalized 1/QPS right?

It's actually 1/(normalized QPS), that was just easier. But yeah, should probably do normalized 1/QPS instead :-)

Maybe just plotting QPS would make it more interpretable?

The only reason for normalizing is to have a standard scale for each dataset.

Happy to play around with this too btw. I want to prioritize running the benchmarks first though, but I'll soon get to the point of having to plot the results :)

Definitely, happy to leave it to you from here :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Box Plot Summary Charts #377

Suggestion: Box Plot Summary Charts #377

thomasahle commented Apr 13, 2023 •

edited

maumueller commented Apr 13, 2023

erikbern commented Apr 13, 2023

thomasahle commented Apr 13, 2023

maumueller commented Apr 13, 2023

thomasahle commented Apr 13, 2023

thomasahle commented Apr 13, 2023

erikbern commented Apr 13, 2023

thomasahle commented Apr 13, 2023

maumueller commented Apr 13, 2023

thomasahle commented Apr 13, 2023

erikbern commented Apr 13, 2023

thomasahle commented Apr 13, 2023

Suggestion: Box Plot Summary Charts #377

Suggestion: Box Plot Summary Charts #377

Comments

thomasahle commented Apr 13, 2023 • edited

maumueller commented Apr 13, 2023

erikbern commented Apr 13, 2023

thomasahle commented Apr 13, 2023

maumueller commented Apr 13, 2023

thomasahle commented Apr 13, 2023

thomasahle commented Apr 13, 2023

erikbern commented Apr 13, 2023

thomasahle commented Apr 13, 2023

maumueller commented Apr 13, 2023

thomasahle commented Apr 13, 2023

erikbern commented Apr 13, 2023

thomasahle commented Apr 13, 2023

thomasahle commented Apr 13, 2023 •

edited