Come up with a better matcher that can handle large amounts of data #90

mdedetrich · 2022-01-24T12:57:02Z

What is currently missing?

Unfortunately due to S3 having a large minimum chunk size (5 megabytes) we often have to deal with comparing large amounts of data when we do our tests against S3. What invariably ends up happening is we compare very large data structures (typically 10mb's or more) with eachother to see if they are equal which uses large amounts of CPU, enough so that it can actually create bottlenecks.

How could this be improved?

There are 2 cases we have to deal with

Doing a very fast comparison to see that large data structures are equal
If two data structures are not equal, finding a very fast way of figuring out "whats wrong"

In regards to 1, there isn't too much we can do apart from potentially dealing with raw numbers rather than String's on the suspicion that scalacheck's generated String's is causing too many hash collisions that is slowing down the equals method. The easiest solution here may be to use a custom hashcode that works better with the generated data. Alternately one can make sure that the data generated in io.aiven.guardian.kafka.Generators are incrementing numbers which will generate less collisions, however they you have to deal with serializing the strings into numbers when implementing a custom hashcode.

In regards to 2, we are currently using https://github.com/softwaremill/diffx to create nice diff's if the 2 data structures are not equal however diffx isn't designed to handle large data structures nicely. More specifically it can handle cases were a single value in some data structure is wrong or missing but what typically happens in our S3 tests is not a single value is missing but instead entire chunks of data are missing (i.e. a backup files for a single chunk). Using algorithms which are are very fast at quickly detecting these missing large chunks of data and handling overlapping data and then falling back to slower/more general methods can theoretically greatly reduce the amount of CPU time that is being used. In other words we want to have a fast path for the most common cause of test failures and then fallback to the slower/current algorithm (if possible)

Is this a feature you would work on yourself?

I plan to open a pull request for this feature

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come up with a better matcher that can handle large amounts of data #90

Come up with a better matcher that can handle large amounts of data #90

mdedetrich commented Jan 24, 2022 •

edited

Come up with a better matcher that can handle large amounts of data #90

Come up with a better matcher that can handle large amounts of data #90

Comments

mdedetrich commented Jan 24, 2022 • edited

What is currently missing?

How could this be improved?

Is this a feature you would work on yourself?

mdedetrich commented Jan 24, 2022 •

edited