Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OpenFOAM demo #32

Merged
merged 11 commits into from
May 8, 2024
Merged

Update OpenFOAM demo #32

merged 11 commits into from
May 8, 2024

Conversation

laraPPr
Copy link
Contributor

@laraPPr laraPPr commented Apr 19, 2024

No description provided.

@laraPPr
Copy link
Contributor Author

laraPPr commented Apr 19, 2024

The OpenFOAM demo can trigger this issue with v2023.06 of the EESSI production repo.

Also opened a pr to the docs with the workaround: EESSI/docs#172

surfaceFeatures 2>&1 | tee log.surfaceFeatures
blockMesh 2>&1 | tee log.blockMesh
decomposePar -copyZero 2>&1 | tee log.decomposePar
mpirun -np $NP -ppn $PPN -hostfile hostlist snappyHexMesh -parallel -overwrite 2>&1 | tee log.snappyHexMesh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A hostfile is a must here without which the mpirun command would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this true here but not below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than try to support multi-node in the demos, I think it would be better to just add the disclaimer that they are prepared for single node use cases, but could be extended to multi-node cases. If you submit this job via SLURM you wouldn't need the complication of a hostfile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this stands, I don't think it would work would it since there is no hostlist file?

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output file should be removed

surfaceFeatures 2>&1 | tee log.surfaceFeatures
blockMesh 2>&1 | tee log.blockMesh
decomposePar -copyZero 2>&1 | tee log.decomposePar
mpirun -np $NP -ppn $PPN -hostfile hostlist snappyHexMesh -parallel -overwrite 2>&1 | tee log.snappyHexMesh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mpirun -np $NP -ppn $PPN -hostfile hostlist snappyHexMesh -parallel -overwrite 2>&1 | tee log.snappyHexMesh
mpirun -np $NP --oversubscribe snappyHexMesh -parallel -overwrite 2>&1 | tee log.snappyHexMesh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have removed this in my latest script which @laraPPr is still testing. Her next commit should make things clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Satish script for version 11 it is changed to this mpirun -np $NP -npernode $PPN snappyHexMesh -parallel -overwrite 2>&1 | tee log.snappyHexMesh. And that one now works correctly. I'm also less inclined to change this one because it still does something and is only their for the OpenFOAM in the pilot repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a problem with this (since it will soon be lost to history anyway), but the file hostlist does not exist as far as I can see, so it really does nothing (or is it just a workaround for the use of -ppn?).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually -ppn is deprecated and -npernode is to be used instead with mpirun. Indeed the hostlist thing didn't make any sense to me so I also removed it as @ocaisa did. We can also get rid of --oversubscribe option and just declare that one needs to have at least 8 cores ( I would say we even need more) to run this script.

@ocaisa
Copy link
Member

ocaisa commented May 1, 2024

Can we change OpenFOAM8.sh to OpenFOAM_v8.sh and OpenFOAM_11.sh to OpenFOAM_v11.sh?

@boegel
Copy link
Collaborator

boegel commented May 1, 2024

There's a lot going on in this PR, and there may be some parts in the current demo script that isn't fully correct.

Rather than trying to fix all problems in one go, we should try and get the minimal things done so the OpenFOAM demo can be run using the OpenFOAM installations available in software.eessi.io.

This demo is by no means meant to be a perfect example of running OpenFOAM, so in the short term it's fine if some of the substeps aren't currently doing what they should be doing. The main thing is that we can show people that the OpenFOAM installation provided by EESSI runs (not that we're experts in OpenFOAM).

So, perhaps we should open a separate PR first with the minimal work to do to get the OpenFOAM demo running with software.eessi.io, and open an issue with an overview of the problems to fix?

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with software.eessi.io and worked just fine

@ocaisa
Copy link
Member

ocaisa commented May 2, 2024

As a side point, on 4 cores I actually got an MPI error from snappyHexMesh, which made me notice that we don't actually check the return codes before proceeding to the next step.

Having said that, I agree with @boegel that we should just get this merged and worry about tweaking the scripts later (or never 😛 ).

@satishskamath
Copy link

As a side point, on 4 cores I actually got an MPI error from snappyHexMesh, which made me notice that we don't actually check the return codes before proceeding to the next step.

Having said that, I agree with @boegel that we should just get this merged and worry about tweaking the scripts later (or never 😛 ).

Oh yeah, we do not check the error codes from the previous steps at all. :D Btw what was the error. I am curious. :)

@ocaisa
Copy link
Member

ocaisa commented May 2, 2024

Oh yeah, we do not check the error codes from the previous steps at all. :D Btw what was the error. I am curious. :)

Surface refinement iteration 7
------------------------------

Marked for refinement due to surface intersection          : 3543 cells.
Marked for refinement due to curvature/regions             : 1282668 cells.
Determined cells to refine in = 1.43973 s
Selected for refinement : 1458107 cells (out of 7448582)
Balanced mesh in = 16.9464 s
After balancing surface refinement iteration 7 : cells:7448582  faces:24702264  points:9838521
Cells per refinement level:
    0   159607
    1   1429
    2   5209
    3   27745
    4   163998
    5   777950
    6   3596644
    7   2716000
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 275331 on node node1 exited on signal 9 (Killed).
--------------------------------------------------------------------------

@laraPPr
Copy link
Contributor Author

laraPPr commented May 2, 2024

Ok So I now remove the --oversubscribe flag everywhere and I added it to the run.sh script. And also remove the -hosfile flag form the bike_OpenFOAM_v8.sh script.

OpenFOAM/run.sh Outdated Show resolved Hide resolved
Co-authored-by: ocaisa <alan.ocais@cecam.org>
@laraPPr
Copy link
Contributor Author

laraPPr commented May 3, 2024

@ocaisa is it now ok? Or should I change something else?

@boegel
Copy link
Collaborator

boegel commented May 3, 2024

Oh yeah, we do not check the error codes from the previous steps at all. :D Btw what was the error. I am curious. :)

Surface refinement iteration 7
------------------------------
...
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 275331 on node node1 exited on signal 9 (Killed).
--------------------------------------------------------------------------

That looks like a segfault due to memory limit?

@ocaisa
Copy link
Member

ocaisa commented May 3, 2024

Oh yeah, we do not check the error codes from the previous steps at all. :D Btw what was the error. I am curious. :)

Surface refinement iteration 7
------------------------------
...
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 275331 on node node1 exited on signal 9 (Killed).
--------------------------------------------------------------------------

That looks like a segfault due to memory limit?

Could be, I didn't investigate

OpenFOAM/run.sh Outdated Show resolved Hide resolved
@laraPPr
Copy link
Contributor Author

laraPPr commented May 6, 2024

@boegel I updated the PR

@ocaisa
Copy link
Member

ocaisa commented May 8, 2024

I'm merging this and following up with a PR to drop the use of the pilot

@ocaisa ocaisa merged commit 684ef83 into EESSI:main May 8, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants