Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Speed up a few slowdowns when handling large datasets #522

Open
flixha opened this issue Dec 8, 2022 · 11 comments
Open

WIP: Speed up a few slowdowns when handling large datasets #522

flixha opened this issue Dec 8, 2022 · 11 comments

Comments

@flixha
Copy link
Collaborator

flixha commented Dec 8, 2022

This is a summary thread for a few slowdowns that I noticed when handling large-ish datasets (e.g., 15000 templates x 500 channels x 1800 samples). I'm not calling them "bottlenecks" because it's rather the sum of things together that take extra time, none of these slowdowns changes EQcorrscan fundamentally.

I will create some PRs for each point that I have a suggested solution so that we can systematically merge, improve, or reject the suggested solutions. Will add the links to the PRs here, but I'll need to organize a bit for that..

Here are some slowdowns (tests with python 3.11, all in serial code):

Is your feature request related to a problem? Please describe.
All the points in the upper list occur in parts of the code where parallelization cannot help speed up execution . When running EQcorrscan on a big cluster, it's wasteful to spend as much time reorganizing data in serial as it takes to run the well parallelized template matching correlations etc.

Three more slowdowns where parallelization can help:

    1. core.match_filter: 2.5x speedup for MAD threshold calc
    • np.median(np.abs(cccsum)) for each cccsum takes a lot of time when there are many cccsum in cccsums. Only quicker solution I found was to parallelize the operation, which surprisingly could speed up problems bigger than ~15 cccsum already. The speedup is only ~2.5x, so even though that matters a lot for many cccsum (e.g., 2000: 20 s vs 50 s), it feels like this has more potential for even more speedup.
    • PR: Speedup 07: 2.5x speedup with parallel parallel median / MAD calculation #531
    1. detection._calculate_event: 35% speedup in parallel
    • calling this for many detections is slow when a lot of events need to be created. Parallelization can help to speed this up a bit (35 % for 460 detections in test case).
    1. utils.catalog_to_dd.write_correlations: 20 % speedup using some shared memory
@flixha
Copy link
Collaborator Author

flixha commented Dec 12, 2022

I have now added 5 PRs for the upper list that you can have a look at - the links to the PRs are added to the list.

I'll prepare the PRs for the lower list (points 8-10.) in a bit; but these could maybe benefit from some extra ideas.

@calum-chamberlain
Copy link
Member

Awesome @flixha - thanks for this! Sorry for no response - hopefully you received at least one out of office email from me - I have been in the field for about the last month and it is now our shutdown period so I likely won't get to this until January. Sorry again, and thanks for all of this - this looks like it is going to be very valuable!

@calum-chamberlain
Copy link
Member

Hey @flixha - Happy new year! I'm not back at work yet, but I wanted to get the CI tests running again to try and help you out this those PRs - it looks like a few PRs are either failing or need tests. Happy to have more of a look in the next week or so.

@flixha
Copy link
Collaborator Author

flixha commented Jan 3, 2023

Happy New Year @calum-chamberlain ! Yes I did see that you were in the field, and since I posted the PRs just before the holiday break I certainly did not expect a very quick reply - was just a good way for me to finally cut these changes into compact PRs. Thanks a lot for already fixing the tests and going through all the PRs!

@flixha
Copy link
Collaborator Author

flixha commented Jan 9, 2023

I'm looking into two more points right now which can slow down things for larger datasets:

  1. matched_filter loops over all templates to make families with the related detections. When adding the family to the party (see this line), there is another loop in which the template for the family being added is compared to each template that already forms a family in the party. That comparison comprises a check of all properties including the trace data and event properties, hence is not super cheap. And it's scaling with O^2 related to the number of templates/families, so in an example with 15500 templates there were 544 561 637 calls to template.__eq__, costing in total 250 seconds).
    • I think we don't really need the full check for existing families in matched_filter as long as the tribe doesn't contain duplicated templates. But for that we have other checks already, so here a simple concatenation without the checks should suffice?
    • i.e., party.families.append(family) instead of party += family
  2. For the full dataset, utils.correlate._get_array_dicts takes 400 seconds to prepare the data, which is in part due to handling a lot of UTCDateTime, sorting the traces in each template, and other list / numpy operations. I'm looking into how to speed that up... (PR here: Speedup 10: 25 % quicker _get_array_dicts and 10 % quicker _prep_data_for_correlation #536)

@calum-chamberlain
Copy link
Member

@flixha - do you make use of the group_size argument to split your large template sets into groups for matched-filtering? I am starting to work on changing how the different steps of the process are executed to try and make better use of cpu time.

As you well know, a lot of the operations before getting to the correlation stage are done in serial, and a lot of them do not make sense to port to parallel processing. What I am thinking about is implementing something similar to how seisbench uses either asynio or multiprocessing to process the next chunk of data while other operations are taking place. See their baseclass for an example of how they implement this. I don't think using asyncio would make sense for EQcorrscan because most of the slow points are not io bound, so using multiprocessing makes more sense. By doing this we should spend less time in purely serial code, and should see speed-ups (I think). In part this is motivated by me getting told off by HPC maintainers for not hammering their machines enough!

I'm keen to do this and there are two options I see:

  1. Implementing at a group level, such that the next group's steps are processed in advance. This would work on the same chunk of data, so you would only see speed-ups for large template sets.
  2. Implementing at a higher level across chunks of data, possibly in Tribe.client_detect such that the next chunk of data is downloaded and processed in advance.
    I'm not sure that these two options are mutually exclusive, but it would take some thinking to make sure that I'm not trying to create child processes from child processes.

Keen to hear your thoughts. I think this would represent a fairly significant refactor of the code in Tribe and matched_filter so I was hoping to work on this once we have merged all of your speed-ups to avoid conflicts.

@flixha
Copy link
Collaborator Author

flixha commented Mar 16, 2023

Hi @calum-chamberlain ,
yes I do use the group_size in more or less all the bigger problem sets as I'd otherwise run out of memory. For a set of 15k templates I was using group_size=1570 (i.e., 10 groups) with fast_matched_filter on 4x A100 (4x16 GB GPU memory, 384 GB system memory), while with fmf2 on 4x AMD MI-250X (4x128 GB GPU memory, 480 GB system memory) I was using group_size=5500 (3 groups; caveat: fmf and fmf2 may duplicate some memory unnecessarily across multiple GPUs, but that's a minor thing) - just for some examples (don't remember the biggest fftw examples right now). So of course the preparation of the data before starting the correlations for a group does take quite a bit of time, so good to hear you're thinking about this.

Just some quick thoughts (may add more tomorrow when I'm fully awake ;-)):

  • For some of my problem sets where the channel selection varies quite a bit between templates, it would save time and memory to "smartly" group the templates in those that share as many channels as possible - means less NaN-channels have to be created / copied which a good chunk of time.
  • The bigger the group, the more NaN-channels per template may have to be created in memory, with the template with most traces contributing considerably
  • I have thought that asynchronous preparation of the data while correlations are running - sounds like a good idea but there's some caveats:
    • On Slurm clusters, I run array jobs to chunk the problem at the highest level (across time periods), so that I distribute the serial parts of the code across as many array tasks as possible. Usually I can fit at most 3 array tasks into one node, and mostly just one, due to the memory requirements.
    • So at some point working at a higher level in python rather than e.g. Slurm would require looking into parallelization across nodes / distributed memory
    • So I'm a bit afraid that preparing an extra group while the current one is running may not pay off so well. It sounds tempting especially when running correlations on GPUs to make better use of the CPUs in the meantime, but the biggest "spike" in memory consumption for GPUs may be when the full correlation matrix is returned from the GPU memory to the main memory, so I'd rather not give away any at that point.
    • For my problems, preparing the arrays for correlation took probably about 2x - 8x of local reading and preprocessing
    • For multiple groups we spend a lot of time preprocessing traces, creating traces / arrays in memory that we then throw away again - there could be some gain in keeping preprocessed streams and arrays between groups if the processing parameters stay the same

@flixha
Copy link
Collaborator Author

flixha commented Mar 16, 2023

  * On Slurm clusters, I run array jobs to chunk the problem at the highest level (across time periods), so that I distribute the serial parts of the code across as many array tasks as possible. Usually I can fit at most 3 array tasks into one node, and mostly just one, due to the memory requirements.

To specify: for the 15k templates I could only fit one array task per node; while for picking with the same tribe I could fit 2 or 3, depending on the node configuration.

@calum-chamberlain
Copy link
Member

Thanks for those points.

  • I thought we were grouping somewhat sensibly, but it might just be grouping based on processing rather than the most consistent channel sets - this should definitely be done.
  • I do similar with splitting chunks of time across nodes. However I have been chunking my datasets into ndays where ndays is typically total days / nodes rather than splitting day by day. In that regard, eagerly processing the next timestep of data would be fine because it would stay on the same node.
  • I have been wondering for a while about using dask's schedulers for working across clusters. At some point I would like to explore this more, but I think it might be worth making a new project that uses EQcorrscan. I haven't managed to get the schedulers to work how I expect them to work either.
  • Agree on duplicated work. When EQcorrscan started I did not anticipate either the scales of template sets that we would end up using it for (I thought a couple of hundred templates was big back in the day!), nor that we might use different parameters within a Tribe - the current wasteful logic is a result of that. I would really like to redesign EQcorrscan (less duplication, more time directly working on numpy arrays rather than obspy traces, making use of multithreading on GIL-releasing funcs, ..., the list goes on) but I don't know if I will ever have time.

@flixha
Copy link
Collaborator Author

flixha commented Mar 17, 2023

* I thought we were grouping somewhat sensibly, but it might just be grouping based on processing rather than the most consistent channel sets - this should definitely be done.

I think it's only processing parameters right now, so indeed it should be worth changing that

* I do similar with splitting chunks of time across nodes. However I have been chunking my datasets into `ndays` where `ndays` is typically `total days / nodes` rather than splitting day by day. In that regard, eagerly processing the next timestep of data would be fine because it would stay on the same node.

If I understand correctly, each node gets as a task e.g. "work on these 30 days", is that correct? Indeed that's how I do it; restarting the process for each day would be too costly for reading the tribe and some other metadata (but maybe I misunderstood)

* I have been wondering for a while about using [dask's](https://docs.dask.org/en/stable/) schedulers for working across clusters. At some point I would like to explore this more, but I think it might be worth making a new project that uses EQcorrscan. I haven't managed to get the schedulers to work how I expect them to work either.

That's probably the more sensible way to go for now. As an additional point to this; right now it's easy for the user to their own preparations on all the streams according to their processing before calling EQcorrscan's main detection functions; with dask I imagine the order of things and how to do them would need some thought (set up / distribute multi-node jobs; let user do some early processing; start preprocessing / detection functions)

* Agree on duplicated work. When EQcorrscan started I did not anticipate either the scales of template sets that we would end up using it for (I thought a couple of hundred templates was big back in the day!), nor that we might use different parameters within a Tribe - the current wasteful logic is a result of that. I would really like to redesign EQcorrscan (less duplication, more time directly working on numpy arrays rather than obspy traces, making use of multithreading on GIL-releasing funcs, ..., the list goes on) but I don't know if I will ever have time.

I totally understand and I think it wasn't so easy to anticipate that all. And now I'm where happy that EQcorrscan has gotten more and more mature, especially handling all kinds of edge cases that could mess things up earlier. Using pure numpy arrays would be very nice, but would also have made debugging all the edge cases harder.

Just some more material for some thoughts: I'm attaching two cProfile-output files from a 1-day per node run that I described above. For a set of 15k templates, for 1 day of data:

  1. with ´fast_matched_filter´ on 4x A100 (4x16 GB GPU memory, 384 GB system memory), group_size=1570 (i.e., 10 groups): detect_15500templates_1_fmf.profile.log
  2. with ´fmf2´ on 4x AMD MI-250X (4x128 GB GPU memory, 480 GB system memory): detect_15500templates_2_fmf2.profile.log

Screenshot 2023-03-17 at 09 19 40
With ´snakeviz ´ this looks like above and gives some ideas of where a lot of time could be gained (let's ignore tribe reading because it only needs to happen once for multiple days):

  • ´utcdatetime´ - 4 functions in top13 in terms fo ´tottime´ (actual time spent in the functions itself)
  • some of the duplicated preprocessing (even thought that's not so obvious from these plots / logs)
  • numpy-based working on traces would work a round a lot of course...

@calum-chamberlain
Copy link
Member

As you have already identified, it looks like a lot of time is spent copying nan-channels. Hopefully more efficient grouping in #541 will help reduce this time-sink.

You did understand me correctly with the grouping of days into chunks. Using this approach we could for example have a workflow that for each step puts the result in a multiprocessing Queue that the next step queries as it's input. This could increase parallelism by letting the necessarily serial components (getting the data, doing some of the prep work, ...) run concurrently with other tasks, with those other tasks ideally implementing parallelism using GIL-releasing threading (e.g. #540), openmp parallelism (e.g. our fftw correlations, and fmf cpu correlations) or gpu parallelism (e.g. fmf or your fmf2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants