Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with a better matcher that can handle large amounts of data #90

Open
1 task
mdedetrich opened this issue Jan 24, 2022 · 0 comments
Open
1 task

Comments

@mdedetrich
Copy link
Collaborator

mdedetrich commented Jan 24, 2022

What is currently missing?

Unfortunately due to S3 having a large minimum chunk size (5 megabytes) we often have to deal with comparing large amounts of data when we do our tests against S3. What invariably ends up happening is we compare very large data structures (typically 10mb's or more) with eachother to see if they are equal which uses large amounts of CPU, enough so that it can actually create bottlenecks.

How could this be improved?

There are 2 cases we have to deal with

  1. Doing a very fast comparison to see that large data structures are equal
  2. If two data structures are not equal, finding a very fast way of figuring out "whats wrong"

In regards to 1, there isn't too much we can do apart from potentially dealing with raw numbers rather than String's on the suspicion that scalacheck's generated String's is causing too many hash collisions that is slowing down the equals method. The easiest solution here may be to use a custom hashcode that works better with the generated data. Alternately one can make sure that the data generated in io.aiven.guardian.kafka.Generators are incrementing numbers which will generate less collisions, however they you have to deal with serializing the strings into numbers when implementing a custom hashcode.

In regards to 2, we are currently using https://github.com/softwaremill/diffx to create nice diff's if the 2 data structures are not equal however diffx isn't designed to handle large data structures nicely. More specifically it can handle cases were a single value in some data structure is wrong or missing but what typically happens in our S3 tests is not a single value is missing but instead entire chunks of data are missing (i.e. a backup files for a single chunk). Using algorithms which are are very fast at quickly detecting these missing large chunks of data and handling overlapping data and then falling back to slower/more general methods can theoretically greatly reduce the amount of CPU time that is being used. In other words we want to have a fast path for the most common cause of test failures and then fallback to the slower/current algorithm (if possible)

Is this a feature you would work on yourself?

  • I plan to open a pull request for this feature
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant