Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running ngen using mpirun generates a NetCDF: HDF error #749

Closed
stcui007 opened this issue Feb 29, 2024 · 11 comments · Fixed by #819
Closed

Running ngen using mpirun generates a NetCDF: HDF error #749

stcui007 opened this issue Feb 29, 2024 · 11 comments · Fixed by #819
Assignees
Labels
bug Something isn't working

Comments

@stcui007
Copy link
Contributor

The current ngen code runs correctly in serial mode with NetCDF forcing file, but produces a HDF5 file close error after finishing the time steps in MPI parallel mode.

Current behavior

The mpirun finishes the 720 time steps but output a HDF5 file close error afterward as follows.
.....
Finished 720 timesteps.
NGen top-level timings:
NGen::init: 0.651571
NGen::simulation: 0.416964
NGen::routing: 7.173e-06
HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 0:
#000: H5D.c line 332 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
NetCDF: HDF error
file: ncFile.cpp line:33
HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 0:
#000: H5D.c line 332 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
NetCDF: HDF error
file: ncFile.cpp line:33
HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 0:
#000: H5D.c line 332 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
NetCDF: HDF error
file: ncFile.cpp line:33

Expected behavior

......
Finished 720 timesteps.
NGen top-level timings:
NGen::init: 0.511167
NGen::simulation: 0.605978
NGen::routing: 1.281e-06

At least for the serial run.

Steps to replicate behavior (include URLs)

Build the codes with:

cmake \
    -DNetCDF_ROOT=/local/lib \
    -DNGEN_WITH_MPI:BOOL=ON         \
    -DNGEN_WITH_NETCDF:BOOL=ON       \
    -DNGEN_WITH_SQLITE:BOOL=ON      \
    -DNGEN_WITH_UDUNITS:BOOL=ON      \
    -DNGEN_WITH_BMI_FORTRAN:BOOL=ON  \
    -DNGEN_WITH_BMI_C:BOOL=ON        \
    -DNGEN_WITH_PYTHON:BOOL=ON       \
    -DNGEN_WITH_ROUTING:BOOL=ON      \
    -DNGEN_WITH_TESTS:BOOL=ON        \
    -DNGEN_QUIET:BOOL=OFF            \
    -B cmake_build                   \
    -S .
cmake --build cmake_build --target ngen -j 8

Run ngen as follows where test_partition_cats3.json is a partition file generated with catchment_data.geojson and nexus_data.geojson hydrofabric that come with the master branch.

mpirun -n 3 ./cmake_build/ngen data/catchment_data.geojson '' data/nexus_data.geojson '' data/example_bmi_multi_realization_config_w_netcdf.json test_partition_cats3.json`

The run finishes the 720 time steps but output a HDF5 file close error afterward.

Screenshots

@program-- program-- added the bug Something isn't working label Feb 29, 2024
@stcui007
Copy link
Contributor Author

stcui007 commented Mar 4, 2024

There is some old code at https://github.com/stcui007/ngen/tree/I481_netcdf that seems to run the MPI job to the finish without error, which might possibly provide an initial start point for a satisfactory solution.

@stcui007
Copy link
Contributor Author

stcui007 commented Mar 5, 2024

Here is the netCDF related info:
$ nc-config --all

This netCDF 4.7.4 has been built with the following features:

  --cc            -> mpicc
  --cflags        -> -I/local/lib/include -I/local/lib/include
  --libs          -> -L/local/lib/lib -lnetcdf
  --static        -> -lhdf5_hl -lhdf5 -lm -ldl -lz -lcurl

  --has-c++       -> no
  --cxx           ->

  --has-c++4      -> yes
  --cxx4          -> mpicxx
  --cxx4flags     -> -I/local/lib/include -I/local/lib/include
  --cxx4libs      -> -L/local/lib/lib -lnetcdf_c++4 -lnetcdf

  --has-fortran   -> yes
  --fc            -> mpif90
  --fflags        -> -I/local/lib/include -I/local/lib/include
  --flibs         -> -L/local/lib/lib -lnetcdff -L/local/lib/lib -L/local/lib/lib64 -lnetcdf -lnetcdf -ldl -lm
  --has-f90       ->
  --has-f03       -> yes

  --has-dap       -> yes
  --has-dap2      -> yes
  --has-dap4      -> yes
  --has-nc2       -> yes
  --has-nc4       -> yes
  --has-hdf5      -> yes
  --has-hdf4      -> no
  --has-logging   -> no
  --has-pnetcdf   -> no
  --has-szlib     -> no
  --has-cdf5      -> yes
  --has-parallel4 -> yes
  --has-parallel  -> yes

  --prefix        -> /local/lib
  --includedir    -> /local/lib/include
  --libdir        -> /local/lib/lib
  --version       -> netCDF 4.7.4

@PhilMiller
Copy link
Contributor

Which machine is that on?

@stcui007
Copy link
Contributor Author

stcui007 commented Mar 5, 2024

It is on UCS3, probaly some other UCS series also.

@stcui007
Copy link
Contributor Author

stcui007 commented Mar 5, 2024 via email

@stcui007
Copy link
Contributor Author

stcui007 commented Mar 5, 2024

Just realized, there are more info in /local/lib/lib/pkgconfig.

@stcui007
Copy link
Contributor Author

stcui007 commented Mar 5, 2024

$ h5c++ -showconfig

            SUMMARY OF THE HDF5 CONFIGURATION
            =================================

General Information:
-------------------
                   HDF5 Version: 1.8.12
                  Configured on: Thu Sep 16 02:22:24 UTC 2021
                  Configured by: mockbuild@buildhw-x86-14.iad2.fedoraproject.org
                 Configure mode: production
                    Host system: x86_64-redhat-linux-gnu
              Uname information: Linux buildhw-x86-14.iad2.fedoraproject.org 5.12.14-300.fc34.x86_64 #1 SMP Wed Jun 30 18:30:21 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
                       Byte sex: little-endian
                      Libraries: static, shared
             Installation point: /usr

Compiling Options:
------------------
               Compilation Mode: production
                     C Compiler: /usr/bin/gcc ( gcc (GCC) 4.8.5 20150623 )
                         CFLAGS: -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic
                      H5_CFLAGS: -std=c99 -pedantic -Wall -Wextra -Wundef -Wshadow -Wpointer-arith -Wbad-function-cast -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wnested-externs -Winline -Wfloat-equal -Wmissing-format-attribute -Wmissing-noreturn -Wpacked -Wdisabled-optimization -Wformat=2 -Wunreachable-code -Wendif-labels -Wdeclaration-after-statement -Wold-style-definition -Winvalid-pch -Wvariadic-macros -Winit-self -Wmissing-include-dirs -Wswitch-default -Wswitch-enum -Wunused-macros -Wunsafe-loop-optimizations -Wc++-compat -Wstrict-overflow -Wlogical-op -Wlarger-than=2048 -Wvla -Wsync-nand -Wframe-larger-than=16384 -Wpacked-bitfield-compat -Wstrict-overflow=5 -Wjump-misses-init -Wunsuffixed-float-constants -Wdouble-promotion -Wsuggest-attribute=const -Wtrampolines -Wstack-usage=8192 -Wvector-operation-performance -Wsuggest-attribute=pure -Wsuggest-attribute=noreturn -Wsuggest-attribute=format -O3 -fomit-frame-pointer -finline-functions
                      AM_CFLAGS:
                       CPPFLAGS:
                    H5_CPPFLAGS: -D_POSIX_C_SOURCE=199506L   -DNDEBUG -UH5_DEBUG_API
                    AM_CPPFLAGS: -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_BSD_SOURCE
               Shared C Library: yes
               Static C Library: yes
  Statically Linked Executables: no
                        LDFLAGS: -Wl,-z,relro
                     H5_LDFLAGS:
                     AM_LDFLAGS:
                Extra libraries:  -lsz -lz -ldl -lm
                       Archiver: ar
                         Ranlib: ranlib
              Debugged Packages:
                    API Tracing: no

Languages:
----------
                        Fortran: yes
               Fortran Compiler: /usr/bin/gfortran ( GNU Fortran (GCC) 4.8.5 20150623 )
          Fortran 2003 Compiler: yes
                  Fortran Flags: -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -I/usr/lib64/gfortran/modules
               H5 Fortran Flags:
               AM Fortran Flags:
         Shared Fortran Library: yes
         Static Fortran Library: yes

                            C++: yes
                   C++ Compiler: /usr/bin/g++ ( g++ (GCC) 4.8.5 20150623 )
                      C++ Flags: -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic
                   H5 C++ Flags:
                   AM C++ Flags:
             Shared C++ Library: yes
             Static C++ Library: yes

Features:
---------
                  Parallel HDF5: no
             High Level library: yes
                   Threadsafety: no
            Default API Mapping: v18
 With Deprecated Public Symbols: yes
         I/O filters (external): deflate(zlib),szip(encoder)
         I/O filters (internal): shuffle,fletcher32,nbit,scaleoffset
                            MPE: no
                     Direct VFD: no
                        dmalloc: no
Clear file buffers before write: yes
           Using memory checker: no
         Function Stack Tracing: no
                           GPFS: no
      Strict File Format Checks: no
   Optimization Instrumentation: no
       Large File Support (LFS): yes

@program--
Copy link
Contributor

program-- commented Mar 5, 2024

I suspect that since NetCDF is built with parallel support, and HDF5 does not have parallel support, that is the (at least part of) the issue.

For reference, the NetCDF file in question based on the realization config used:

$ ncinfo data/forcing/cats-27_52_67-2015_12_01-2015_12_30.nc
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes): time(720), catchment-id(3), str_dim(1)
    variables(dimensions): <class 'str'> ids(catchment-id), float64 Time(catchment-id, time), float32 APCP_surface(catchment-id, time), float32 DLWRF_surface(catchment-id, time), float32 DSWRF_surface(catchment-id, time), float32 PRES_surface(catchment-id, time), float32 SPFH_2maboveground(catchment-id, time), float32 TMP_2maboveground(catchment-id, time), float32 UGRD_10maboveground(catchment-id, time), float32 VGRD_10maboveground(catchment-id, time), float32 precip_rate(catchment-id, time)
    groups:

$ ncdump -h data/forcing/cats-27_52_67-2015_12_01-2015_12_30.nc
netcdf cats-27_52_67-2015_12_01-2015_12_30 {
dimensions:
        time = UNLIMITED ; // (720 currently)
        catchment-id = 3 ;
        str_dim = 1 ;
variables:
        string ids(catchment-id) ;
        double Time(catchment-id, time) ;
                Time:units = "ns" ;
        float APCP_surface(catchment-id, time) ;
        float DLWRF_surface(catchment-id, time) ;
        float DSWRF_surface(catchment-id, time) ;
        float PRES_surface(catchment-id, time) ;
        float SPFH_2maboveground(catchment-id, time) ;
        float TMP_2maboveground(catchment-id, time) ;
        float UGRD_10maboveground(catchment-id, time) ;
        float VGRD_10maboveground(catchment-id, time) ;
        float precip_rate(catchment-id, time) ;
}

@stcui007
Copy link
Contributor Author

stcui007 commented Mar 5, 2024

The partition file:
{
"partitions":[
{"id":0,
"cat-ids":["cat-67"],
"nex-ids":["nex-68"],
"remote-connections":[]},
{"id":1,
"cat-ids":["cat-52"],
"nex-ids":["nex-34"],
"remote-connections":[]},
{"id":2,
"cat-ids":["cat-27"],
"nex-ids":["nex-26"],
"remote-connections":[]} ]
}

@program--
Copy link
Contributor

We identified a potential cause for this issue: the NetCDF files are not being closed before MPI_Finalize() is called in the main NGen application code because the shared pointers created by the NetCDF data provider still have references in the formulation objects.

In particular, data_access::NetCDFPerFeatureDataProvider::shared_providers clears a static std::map holding a reference to all instances of NetCDF data providers, but it does not affect the formulations, as they also hold a shared pointer to their respective NetCDF data providers.

An example of this issue (not exactly, but close enough), is seen in the following minimum example:

#include <mpi.h>
#include <memory>
#include <netcdf>

int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);
    auto x = std::make_shared<netCDF::NcFile>(argv[1], netCDF::NcFile::read);
    for (const auto& kv : x->getVars()) {
        std::cout << kv.first << std::endl;
    }
    MPI_Finalize();
    return 0;
}

where compiling and running this results in:

HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: H5D.c line 481 in H5Dclose(): can't decrement count on dataset ID
    major: Dataset
    minor: Unable to decrement reference count
  #001: H5Iint.c line 1227 in H5I_dec_app_ref_always_close(): can't decrement ID ref count
    major: Object ID
    minor: Unable to decrement reference count
  #002: H5Iint.c line 1197 in H5I__dec_app_ref_always_close(): can't decrement ID ref count
    major: Object ID
    minor: Unable to decrement reference count
  #003: H5Iint.c line 946 in H5I_remove(): can't remove ID node
    major: Object ID
    minor: Can't delete message
  #004: H5Iint.c line 895 in H5I__remove_common(): can't remove ID node from hash table
    major: Object ID
    minor: Can't delete message
  #005: H5Iint.c line 1077 in H5I__dec_app_ref(): can't decrement ID ref count
    major: Object ID
    minor: Unable to decrement reference count
  #006: H5Iint.c line 981 in H5I__dec_ref(): can't locate ID
    major: Object ID
    minor: Unable to find ID information (already closed?)
NetCDF: HDF error
file: ncFile.cpp  line:33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants