Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No output files generating while using mpirun with JAX #477

Open
Sougata18 opened this issue Jun 1, 2023 · 18 comments
Open

No output files generating while using mpirun with JAX #477

Sougata18 opened this issue Jun 1, 2023 · 18 comments

Comments

@Sougata18
Copy link

image
I am trying the veros run with MPI + JAX; While the model run is going on, no output files are generating as well as the progress is not showing in the stdout file. Is the model run stuck ?

@dionhaefner
Copy link
Collaborator

Yes, unfortunately it looks that way. Could you try to add --loglevel trace to see where it gets stuck, then post the output here?

@dionhaefner
Copy link
Collaborator

Also, could you post the output of pip freeze?

@Sougata18
Copy link
Author

Yes, unfortunately it looks that way. Could you try to add --loglevel trace to see where it gets stuck, then post the output here?

I gave the run again. Its now working and generating output files as well
image

Another question I had, how to get the expected time of completion or the progress of model run while using MPI? Cause when I am not using mpi it shows the model progress as well as expected time of completion.

@Sougata18
Copy link
Author

Also, could you post the output of pip freeze?

backports.ssl-match-hostname==3.5.0.1
blivet==0.61.15.72
Brlapi==0.6.0
chardet==2.2.1
configobj==4.7.2
configshell-fb==1.1.23
coverage==3.6b3
cupshelpers==1.0
decorator==3.4.0
di==0.3
dnspython==1.12.0
enum34==1.0.4
ethtool==0.8
fail2ban==0.11.2
firstboot==19.5
fros==1.0
futures==3.1.1
gssapi==1.2.0
idna==2.4
iniparse==0.4
initial-setup==0.3.9.43
ipaddress==1.0.16
IPy==0.75
javapackages==1.0.0
kitchen==1.1.1
kmod==0.1
langtable==0.0.31
lxml==3.2.1
mysql-connector-python==1.1.6
netaddr==0.7.5
netifaces==0.10.4
nose==1.3.7
ntplib==0.3.2
numpy==1.16.6
ofed-le-utils==1.0.3
pandas==0.24.2
perf==0.1
policycoreutils-default-encoding==0.1
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycups==1.9.63
pycurl==7.19.0
pygobject==3.22.0
pygpgme==0.3
pygraphviz==1.6.dev0
pyinotify==0.9.4
pykickstart==1.99.66.19
pyliblzma==0.5.3
pyparsing==1.5.6
pyparted==3.9
pysmbc==1.0.13
python-augeas==0.5.0
python-dateutil==2.8.1
python-ldap==2.4.15
python-linux-procfs==0.4.9
python-meh==0.25.2
python-nss==0.16.0
python-yubico==1.2.3
pytoml==0.1.14
pytz==2016.10
pyudev==0.15
pyusb==1.0.0b1
pyxattr==0.5.1
PyYAML==3.10
qrcode==5.0.1
registries==0.1
requests==2.6.0
rtslib-fb==2.1.63
schedutils==0.4
scikit-learn==0.20.4
scipy==1.2.3
seobject==0.1
sepolicy==1.1
setroubleshoot==1.1
six==1.9.0
sklearn==0.0
slip==0.4.0
slip.dbus==0.4.0
SSSDConfig==1.16.2
subprocess32==3.2.6
targetcli-fb===2.1.fb46
torch==1.4.0
urlgrabber==3.10
urllib3==1.10.2
urwid==1.1.1
yum-langpacks==0.4.2
yum-metadata-parser==1.1.4

@Sougata18
Copy link
Author

what does this line mean?
"export OMP_NUM_THREADS=1"
When I am setting the number to 8, its showing some MPI error.

@dionhaefner
Copy link
Collaborator

Another question I had, how to get the expected time of completion or the progress of model run while using MPI? Cause when I am not using mpi it shows the model progress as well as expected time of completion.

Veros should print progress updates even when using MPI. They may just be drowned out by the trace output. You can switch back to normal verbosity if things are working now.

what does this line mean?
"export OMP_NUM_THREADS=1"
When I am setting the number to 8, its showing some MPI error.

This sets the number of threads used by some packages we rely on (like the SciPy solvers). Since you are using MPI for multiprocessing you shouldn't use more than 1 thread per processor.

@Sougata18
Copy link
Author

Thanks!
Had one doubt regarding the model run status -
Current iteration: 3706 (0.71/1.00y | 42.9% | 4.78h/(model year) | 1.4h left)
what does "4.78h/(model year)" mean ?

@dionhaefner
Copy link
Collaborator

Every simulated year takes 4.78 hours of real time.

@Sougata18
Copy link
Author

Here I'm trying to use 256 cores ( 16 nodes and 16 tasks per node )
image

but an error is presisting :
ValueError: processes do not divide domain evenly in x-direction

@dionhaefner
Copy link
Collaborator

360 (number of grid cells) isn't divisible by 16 (number of processors).

@Sougata18
Copy link
Author

Sougata18 commented Jun 3, 2023

My run got stuck for some reason but the restart file was there which has around 1.5 years of data; when I again gave the run it should start from the point where it last wrote in the restart file, right ? but the model is running from the initial time, i.e., 0th year.
image

I used this code for the model run:
image

@dionhaefner
Copy link
Collaborator

With veros resubmit, you can only restart from completed runs. If frequent crashes are a problem I would recommend to shorten the length of a run to 1 year or so, and schedule more of them.

@Sougata18
Copy link
Author

image
This MPI error is presisting. Can you please check ?

@dionhaefner
Copy link
Collaborator

I can't debug this without further information. Please dump a logfile with --loglevel trace and ideally also export MPI4JAX_DEBUG=1.

I also suggest you get in contact with your cluster support about how MPI should be called. For example, whether --mpi=pmi2 is the correct flag.

@Sougata18
Copy link
Author

Sure! I'll ask the admin once.
Here's both the output and error files.
output_and_error.zip

@dionhaefner
Copy link
Collaborator

Thanks, this is useful. Looks like this may be a problem on our end with mismatched MPI calls. I'll keep looking.

@dionhaefner
Copy link
Collaborator

dionhaefner commented Jun 13, 2023

Actually it looks to me like the MPI calls are correctly matched, it's just that one rank (r33) stops responding for some reason. Unfortunately this will be almost impossible to debug for me. I suggest you talk to your cluster support about it. In the meantime, here are some things you could try as a workaround:

  • Use a different MPI library (if available)
  • Use less MPI ranks and hope that fixes it
  • Only use one type of nodes (I noticed you're running on some nodes called cn and some called gpu; mixing different architectures may be a cause here)
  • Since you seem to have access to GPU nodes, you could try running Veros on few GPUs and probably get a similar performance as with many CPUs.

Hope that helps.

@Sougata18
Copy link
Author

Thanks! I will try these and let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants