New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sequential_distribute_dump fails #147
Comments
@ninnghazad it wouldn't be too hard to store the information in a netcdf file. Most of the data is either numpy arrays or strings. Maybe boundary_map is a dictionary. See line 200 of sequential_distribute.py in anuga/parallel, where the data to be stored is accumulated into an object called tostore, which we then pickle it in the code around line 242 and load around line 272. Could make tostore into a class that collects the info and have alternative ways to store and load, ie pickle or netcdf or a combination of numpy save and pickle for the strings. |
I am currently trying a combination of numpy.save to seperate files and cPickle for the rest. Have to do some bigger tests to see if that is enough for me. However, other parts of the subdivide/dump code seem to not like big meshes either, like pmesh_divide_metis_with_map: Has anybody successfully done simulations using large (as in > 100Mio tris) meshes before? EDIT: Just noticed there also is #15 |
@ninnghazad what system are you working on? I am sure we have had a user run with 50M triangles, but I thought he had also gone higher. |
@stoiver it's some custom numbercruncher, uses two Xeon E5 18c/36t, 512G mem, infiniband and whatnot. it's running debian on a 4.9 kernel and it's a headless system. my goal is to get to above 500Mio tris to simulate strong rainfall over a medium sized city. |
@ninnghazad sounds like an interesting project. It would be great to insure anuga can run on such large problems. You have a large machine, but my guess is that you will need to run such large problems on the nci machine. But I am happy to help. An obvious first step would be to change over to metis 5 as suggested in issue #15 |
@stoiver it surely is. ANUGA could be a rather big thing around here, the flexibility as well as being public open source. I'll see how goes, but for a start i want to test the whole thing using openmpi without the added hassle of networking and go from there. I assume a simulation working in local parallel is easier to migrate to other nodes than one where i have no idea if the problem is the simulation or the setup. Anyway, i appreciate you taking the time to talk about this. The original reason for this issue i seem to have circumvented by using numpy's save(), but i'll test more before considering that solved and possibly uploading the rather naive changes i made. About #15, i actually am trying to shoehorn latest pymetis into distribute_mesh.py. I am still testing, and the way i did it might be too simple, but first results look like it might work. Had to convert domain.neighbours from ndarray to list-copy though, and that might be a rather large amount of not really needed memory-usage. Back to testing. |
@ninnghazad, I had a look at pymetis, it is probably a good way to go, but an alternative would be to use ctypes with a standard installation of metis 5.1. ctypes was not available back when we started the parallel version, back in 2005. |
some diff
@stoiver this works for me, but it's kinda meh. also didn't look into domain.quantities to see if there are more ndarrays. i would assume that the above would still be problematic if one would not as i do set quantities after the distributed mesh is loaded from cache (so quantities aren't all saved in the cache files, making them smaller). have not yet looked into the save-as-ncdf way you mentioned. however, a 120mio triangle test worked (see #15). took ~210sec walltime per 1sec simtime on a wet start and with a bit of rain. merging swws at the end of the simulation isn't quite working yet, but sww_merge.py did the job, so it's probably just some usage-error. and some stats (1sec yields):
that's a lot of comm-overhead right there - suspect my openmpi might be faulty, but that's another issue. |
@ninnghazad great work. I'll checkout your #15 work. But a comment on your timings. I have found that there is not much advantage to using more than the number of physical cores when running in parallel. So I would suggest trying something like 36 processors and see what timings you get. |
@ninnghazad, by the way can you provide some more info on your project. It would be good to point to your projects using these large grids. |
@stoiver The project just started and will probably take at least a bunch of months to finish. There will be some kind of publication, but i can't yet say if all of the resulting data will be public. It entails multiple simulations for a whole city. I will however make sure ANUGA is prominently named in whatever publications or reports there will be, and get back to you with any material i am allowed to share once the project is further along the timeline. Apart from actual results i am of course happy to share the hows and tips i can give. But there still are ways to go and stuff to figure out before i would advice anybody to use my run.py and setup. Just ran into some problem somewhere above 120 mio triangles when doing:
On the other hand converting the 120tri-results to raster (GeoTIFF) for viewing was slow but worked. |
@ninnghazad, I was just curious about your project. There is no expectation for you to make the data public. But updates to the code would be much appreciated! |
@stoiver the goal of the project is to create risk-assessment maps for Dortmund, a city of about 280km2 in western Germany, incorporating local pumping-stations, surface friction as well as different time/rain-sets using a resolution of at most 1m2. |
@stoiver by the way, updates to the code, do you prefer PRs or diffs? |
@ninnghazad our preferred method is PRs. Thanks for the info about your project. |
@ninnghazad Hello Sir...Could you please tell How do you overcome the issues of Large mesh generation and Domain Creation which is sequential step ...We are also trying the same.... |
@Girishchandra-Yendargaye
This method is not optimal and could easily optimized by for example using numpy's memmap format instead of json as intermediary. Also remember that depending on what you are simulating, calculating on domains that large will take a lot of time. A lot of fast CPUs (with AVX512 preferably) and using properly optimized python/numpy helps. Be prepared to wait weeks for simulations to finish. |
Can you please provide some sample code... As I dint find such method in shallow water which accepts only triangle,points and boundary Also What about elevation setting in next step which is sequential again... |
To build the domain i generate (custom) json and use that like this:
I do the above in sequential, and have one big "master-domain", which i then split into however many i need in parallel. When parallel simulation is done, i merge the pieces again, in sequential cause of RAM. To apply heights from GeoTIFF to domain i use something like this:
So the elevation setting can be sequential - but i doesn't have to be, i have use it like that on about 70 cores in parallel. It's not optimized, but as the time it takes compared to the simulation itself is tiny, i didn't mind. Please remember that these are just pieces of code i grabbed out of some of my projects - this is not official ANUGA code and i am not responsible if it burns down your datacenter. Oh, and the JSON format is as simple as can be:
But you could use anything that you can parse and convert to numpy arrays really. |
@ninnghazad Thank you for this information....I assume lines in /mesh_0.triangles.json and mesh_0.boundaries.json are index numbers of mesh_0.points.json starting with 0 correct? |
@Girishchandra-Yendargaye |
@ninnghazad I am getting below error...What does it mean.. RuntimeError: Error at anuga\abstract_2d_finite_volumes\neighbour_mesh_ext.c:55: Am I missing something.... |
i would have to see some data and code to tell you more than the error does. glancing at the source tells me you are trying to construct a boundary with bad parameters. |
@ninnghazad Have you tried 12 crores triangles? |
sequential_distribute_dump fails for large domains in cPickle with
Probably related:
numpy/numpy#2396
https://bugs.python.org/issue11564
https://stackoverflow.com/questions/34091717/avoid-cpickle-error-alternatives-to-cpickle-on-python-2-7
As using python3 isn't easily done it seems, i am unable to save large domains. (roughly >100,000,000 tris) - which is rather annoying when trying multiple variants on same large mesh.
My only advice to other users at this moment would be to not use sequential_distribute_dump to save the combined domain and the parts, but only the partial (per-proc) domains. these might just be small enough to work.
As to how to circumvent the error without upgrading anuga to python3, using numpy.save()/numpy.savez() is the most often suggested solution. (not sure if anuga uses numpy arrays for fields internally)
The text was updated successfully, but these errors were encountered: