Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFDL-ESM2M piControl does not run #377

Open
Jete90 opened this issue Feb 14, 2023 · 5 comments
Open

GFDL-ESM2M piControl does not run #377

Jete90 opened this issue Feb 14, 2023 · 5 comments

Comments

@Jete90
Copy link

Jete90 commented Feb 14, 2023

Hello,

I downloaded the MOM5 code to the WHOI supercomputer.

After compiling GFDL-ESM2M, I tried to run it.
Unfortunately, I quickly ran into segmentation faults when running it.

I attached the error message below.

It might be due to the modules/compiler versions that I am using.

This is what my environment looks like:

source $MODULESHOME/init/csh
module load intel
module load netcdf/intel/4.6.1
module load openmpi/intel

setenv mpirunCommand "mpirun -np"

Kind regards

Jens


ERROR MESSAGE


[...]

LND(ATMOCNLND)= 0.153673308874230 0.153673308874230
0.153673308871445
NOTE from PE 0: xgrid_mod: reading exchange grid information from mosaic grid file
NOTE from load_xgrid(xgrid_mod): field 'scale' exist in the file INPUT/land_mos
aicXocean_mosaic.nc, this field will be read and the exchange grid cell area wi
ll be multiplied by scale
Checked data is array of constant 1
LND(LNDOCN)= 0.703873657789463 0.703873657789466
0.703873657789463
OCN(LNDOCN)= 0.703873657789467 0.703873657789463
0.703873657789466

FATAL from PE 31: ==>Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list 1 0

FATAL from PE 32: ==>Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list 1 0

[.....]

fms_ESM2M.x 0000000000452D04 Unknown Unknown Unknown
fms_ESM2M.x 000000000045BD03 Unknown Unknown Unknown
fms_ESM2M.x 00000000004556BF Unknown Unknown Unknown
fms_ESM2M.x 000000000040E19E Unknown Unknown Unknown
libc-2.17.so 00002AAAAC544555 __libc_start_main Unknown Unknown
fms_ESM2M.x 000000000040E0A9 Unknown Unknown Unknown

MPI_ABORT was invoked on rank 30 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fms_ESM2M.x 0000000002A8FDEE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown
libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fms_ESM2M.x 0000000002A8FDEE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown
libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fms_ESM2M.x 0000000002A8FDEE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown
libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fms_ESM2M.x 0000000002A8FDEE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown
libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown
[pn030:263631] *** Process received signal ***
[pn030:263631] Signal: Segmentation fault (11)
[pn030:263631] Signal code: Address not mapped (1)
[pn030:263631] Failing at address: 0x28
[pn030:263631] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaabe1d630]
[pn030:263631] [ 1] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(+0xb2723)[0x2aaab86c1723]
[pn030:263631] [ 2] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(pmix_ptl_base_recv_handler+0x579)[0x2aaab86c24a9]
[pn030:263631] [ 3] /vortexfs1/apps/openmpi-3.0.1-intel/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xa09)[0x2aaaab021829]
[pn030:263631] [ 4] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(+0x9d0f2)[0x2aaab86ac0f2]
[pn030:263631] [ 5] /lib64/libpthread.so.0(+0x7ea5)[0x2aaaabe15ea5]
[pn030:263631] [ 6] /lib64/libc.so.6(clone+0x6d)[0x2aaaac128b0d]
[pn030:263631] *** End of error message ***
Segmentation fault
ERROR: Model failed to run to completion

@russfiedler
Copy link
Collaborator

@Jete90 This bug originates from using an old netCDF version as documented here NOAA-GFDL/CM4#11 and NOAA-GFDL/icebergs#44

You'll need to update to 4.7.3 or later.

@wienkers
Copy link

wienkers commented Mar 8, 2023

As a follow-up to Jens' question: Does this mean that many of the .res.nc included in the ESM2M piControl test setup provided are corrupt ? I have netCDF v4.7.4, and regardless of compiling with the netCDF4 flag on/off, I still receive the above error that Jens runs into.
Thank you in advance for your help!
Aaron

@russfiedler
Copy link
Collaborator

@wienkers The bug was specific to the iceberg restarts as far as I remember. It's quite possible there are other problems with non ocean restarts.

@wienkers
Copy link

wienkers commented Mar 9, 2023

Thank you for the quick reply @russfiedler
After a bit more digging, this seems to no longer be arising from the netCDF bug.
The error:

Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list    1    0   

points back to flux_exchange_init, where

call mpp_get_compute_domain( Ice%domain, is, ie, js, je )
    kd = size(Ice%ice_mask,3)
    call coupler_type_copy(ex_gas_fields_ice, Ice%ocean_fields, is, ie, js, je, kd,     &
         'ice_flux', Ice%axes, Time, suffix = '_ice›)

At run-time, kd = 6 on the Ice/Atm processes (as it should for num_part = 6 in the input.nml), but kd = 0 on the Ocean processes (which then each throw the error). This block of code is evaluated on all processes; however, it seems like the call to subroutine ice_model_init in coupler_init which allocates Ice%ice_mask only occurs for the Ice processes. So the size information about Ice%ice_mask needed then in the above block of code just becomes 0.

@russfiedler
Copy link
Collaborator

@wienkers Ah, yes, I vaguely remember that was a possibility and that it should only be evaluated on Ice processors. I can't remember if it's sufficient to encase the code in an if(Ice%pe) then...endif block. It should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants