No output files generating while using mpirun with JAX #477

Sougata18 · 2023-06-01T15:06:28Z

I am trying the veros run with MPI + JAX; While the model run is going on, no output files are generating as well as the progress is not showing in the stdout file. Is the model run stuck ?

dionhaefner · 2023-06-02T08:26:22Z

Yes, unfortunately it looks that way. Could you try to add --loglevel trace to see where it gets stuck, then post the output here?

dionhaefner · 2023-06-02T08:27:11Z

Also, could you post the output of pip freeze?

Sougata18 · 2023-06-02T09:07:47Z

Yes, unfortunately it looks that way. Could you try to add --loglevel trace to see where it gets stuck, then post the output here?

I gave the run again. Its now working and generating output files as well

Another question I had, how to get the expected time of completion or the progress of model run while using MPI? Cause when I am not using mpi it shows the model progress as well as expected time of completion.

Sougata18 · 2023-06-02T09:08:36Z

Also, could you post the output of pip freeze?

backports.ssl-match-hostname==3.5.0.1
blivet==0.61.15.72
Brlapi==0.6.0
chardet==2.2.1
configobj==4.7.2
configshell-fb==1.1.23
coverage==3.6b3
cupshelpers==1.0
decorator==3.4.0
di==0.3
dnspython==1.12.0
enum34==1.0.4
ethtool==0.8
fail2ban==0.11.2
firstboot==19.5
fros==1.0
futures==3.1.1
gssapi==1.2.0
idna==2.4
iniparse==0.4
initial-setup==0.3.9.43
ipaddress==1.0.16
IPy==0.75
javapackages==1.0.0
kitchen==1.1.1
kmod==0.1
langtable==0.0.31
lxml==3.2.1
mysql-connector-python==1.1.6
netaddr==0.7.5
netifaces==0.10.4
nose==1.3.7
ntplib==0.3.2
numpy==1.16.6
ofed-le-utils==1.0.3
pandas==0.24.2
perf==0.1
policycoreutils-default-encoding==0.1
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycups==1.9.63
pycurl==7.19.0
pygobject==3.22.0
pygpgme==0.3
pygraphviz==1.6.dev0
pyinotify==0.9.4
pykickstart==1.99.66.19
pyliblzma==0.5.3
pyparsing==1.5.6
pyparted==3.9
pysmbc==1.0.13
python-augeas==0.5.0
python-dateutil==2.8.1
python-ldap==2.4.15
python-linux-procfs==0.4.9
python-meh==0.25.2
python-nss==0.16.0
python-yubico==1.2.3
pytoml==0.1.14
pytz==2016.10
pyudev==0.15
pyusb==1.0.0b1
pyxattr==0.5.1
PyYAML==3.10
qrcode==5.0.1
registries==0.1
requests==2.6.0
rtslib-fb==2.1.63
schedutils==0.4
scikit-learn==0.20.4
scipy==1.2.3
seobject==0.1
sepolicy==1.1
setroubleshoot==1.1
six==1.9.0
sklearn==0.0
slip==0.4.0
slip.dbus==0.4.0
SSSDConfig==1.16.2
subprocess32==3.2.6
targetcli-fb===2.1.fb46
torch==1.4.0
urlgrabber==3.10
urllib3==1.10.2
urwid==1.1.1
yum-langpacks==0.4.2
yum-metadata-parser==1.1.4

Sougata18 · 2023-06-02T09:15:46Z

what does this line mean?
"export OMP_NUM_THREADS=1"
When I am setting the number to 8, its showing some MPI error.

dionhaefner · 2023-06-02T09:48:08Z

Another question I had, how to get the expected time of completion or the progress of model run while using MPI? Cause when I am not using mpi it shows the model progress as well as expected time of completion.

Veros should print progress updates even when using MPI. They may just be drowned out by the trace output. You can switch back to normal verbosity if things are working now.

what does this line mean?
"export OMP_NUM_THREADS=1"
When I am setting the number to 8, its showing some MPI error.

This sets the number of threads used by some packages we rely on (like the SciPy solvers). Since you are using MPI for multiprocessing you shouldn't use more than 1 thread per processor.

Sougata18 · 2023-06-02T11:36:05Z

Thanks!
Had one doubt regarding the model run status -
Current iteration: 3706 (0.71/1.00y | 42.9% | 4.78h/(model year) | 1.4h left)
what does "4.78h/(model year)" mean ?

dionhaefner · 2023-06-02T12:31:50Z

Every simulated year takes 4.78 hours of real time.

Sougata18 · 2023-06-02T17:04:25Z

Here I'm trying to use 256 cores ( 16 nodes and 16 tasks per node )

but an error is presisting :
ValueError: processes do not divide domain evenly in x-direction

dionhaefner · 2023-06-02T20:13:41Z

360 (number of grid cells) isn't divisible by 16 (number of processors).

Sougata18 · 2023-06-03T19:44:25Z

My run got stuck for some reason but the restart file was there which has around 1.5 years of data; when I again gave the run it should start from the point where it last wrote in the restart file, right ? but the model is running from the initial time, i.e., 0th year.

I used this code for the model run:

dionhaefner · 2023-06-05T10:27:27Z

With veros resubmit, you can only restart from completed runs. If frequent crashes are a problem I would recommend to shorten the length of a run to 1 year or so, and schedule more of them.

Sougata18 · 2023-06-13T07:39:12Z

This MPI error is presisting. Can you please check ?

dionhaefner · 2023-06-13T07:56:05Z

I can't debug this without further information. Please dump a logfile with --loglevel trace and ideally also export MPI4JAX_DEBUG=1.

I also suggest you get in contact with your cluster support about how MPI should be called. For example, whether --mpi=pmi2 is the correct flag.

Sougata18 · 2023-06-13T08:15:02Z

Sure! I'll ask the admin once.
Here's both the output and error files.
output_and_error.zip

dionhaefner · 2023-06-13T09:46:55Z

Thanks, this is useful. Looks like this may be a problem on our end with mismatched MPI calls. I'll keep looking.

dionhaefner · 2023-06-13T10:20:46Z

Actually it looks to me like the MPI calls are correctly matched, it's just that one rank (r33) stops responding for some reason. Unfortunately this will be almost impossible to debug for me. I suggest you talk to your cluster support about it. In the meantime, here are some things you could try as a workaround:

Use a different MPI library (if available)
Use less MPI ranks and hope that fixes it
Only use one type of nodes (I noticed you're running on some nodes called cn and some called gpu; mixing different architectures may be a cause here)
Since you seem to have access to GPU nodes, you could try running Veros on few GPUs and probably get a similar performance as with many CPUs.

Hope that helps.

Sougata18 · 2023-06-13T11:03:12Z

Thanks! I will try these and let you know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No output files generating while using mpirun with JAX #477

No output files generating while using mpirun with JAX #477

Sougata18 commented Jun 1, 2023

dionhaefner commented Jun 2, 2023

dionhaefner commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

dionhaefner commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

dionhaefner commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

dionhaefner commented Jun 2, 2023

Sougata18 commented Jun 3, 2023 •

edited

dionhaefner commented Jun 5, 2023

Sougata18 commented Jun 13, 2023

dionhaefner commented Jun 13, 2023

Sougata18 commented Jun 13, 2023

dionhaefner commented Jun 13, 2023

dionhaefner commented Jun 13, 2023 •

edited

Sougata18 commented Jun 13, 2023

No output files generating while using mpirun with JAX #477

No output files generating while using mpirun with JAX #477

Comments

Sougata18 commented Jun 1, 2023

dionhaefner commented Jun 2, 2023

dionhaefner commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

dionhaefner commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

dionhaefner commented Jun 2, 2023

Sougata18 commented Jun 2, 2023

dionhaefner commented Jun 2, 2023

Sougata18 commented Jun 3, 2023 • edited

dionhaefner commented Jun 5, 2023

Sougata18 commented Jun 13, 2023

dionhaefner commented Jun 13, 2023

Sougata18 commented Jun 13, 2023

dionhaefner commented Jun 13, 2023

dionhaefner commented Jun 13, 2023 • edited

Sougata18 commented Jun 13, 2023

Sougata18 commented Jun 3, 2023 •

edited

dionhaefner commented Jun 13, 2023 •

edited