New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ifgram_inversion: dask failure with large number of workers #518
Comments
Hi! I'm not sure if I understood correctly your problem, so sorry if not. Stampede's CPU Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz has a total of 24 cores and 48 threads per processor (you can check it here) if you have two processors you have 96 total threads, which is what is reporting linux under /proc/cpuinfo, and dask is reporting 96threads which seems to be fine. |
Yes, it reports 96 processors. I am not sure whether it should report this number but ifgram_inversion.py fails. In contrast, if I say |
Im confused. |
@falkamelung It'll be much simpler for you to just set minty.compute.numWorker to 48 and not touch it anymore. Additional searches have turned up anything that will pull just the number of cores instead of the cpu count. |
Hi @Ovec8hkin , are you saying you did not find a command to retrieve the 'Thread(s) per core' from the system? As I said, We could just run os.system('lscpu') but something pythonic would be preferred. If there is no immediate solution let me first try a few more systems to see whether os.system('lscpu') would indeed solve the problem. |
There is no way to get the values you want in pure python. You can get either the total number of CPUs (96) via |
Parsing 'lscpu' via |
@yunjunz Thoughts on this? I would really hesitate at parsing the output of |
Are you sure the |
Using |
Using Send me a code block that pulls the correct data from the |
Hi @falkamelung and @Ovec8hkin, I have implemented |
Thank you. Just to let you know dask has still a problem. It normally does not work for me to use |
@falkamelung Could it be that you run out of memory when you have many workers? Any idea how much memory you have on that machine and how much mintpy is using with 36 or 48 workers? |
I don't think it is a memory problem. Here the error message that I get when I run a small dataset with 48 numWorkers. This dataset works fine with numWorkers=6. Below the output from the
|
The Linux server I was using recently has 16 cores and two threads per core. It ran successfully with 32 parallel dask processes. It was using files on a ZFS file system that is directly connected. I did not check how much memory was used, but the machine has 128 GB of RAM. |
Thank you Falk and Eric for the info. This means we still have not located the cause yet. The "conflict during multiple HDF5 writing processes" as described in #692 still sounds like a reasonable guess to me. Update: multiple HDF5 writing processes do not exist in mintpy, thus, that is not the cause. We may revert the |
All the other errors, e.g. To better handle this situation, I added the support of |
Description of the problem
I am running
ifgram_inversion.py
usingmintpy.compute.numWorker=all
on stampede2 (48 cores per node, each 2 threads) and it thinks there are 96 cores:This value is returned by the python
num_core = os.cpu_count()
command inmintpy/objects/cluster.py
. So we need another command to figure out the number of threads and divide. Thelscpu
command does return the proper amount of cores and threads on both Stampede2 and Frontera. I would expect that there is an equivalent in python.Stampede:
Frontera:
The text was updated successfully, but these errors were encountered: