Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI: Update to 3.0.5+|3.1.5+|4.0.1+ or Use ROMIO for IO #446

Open
3 of 5 tasks
ax3l opened this issue Jan 21, 2019 · 1 comment
Open
3 of 5 tasks

OpenMPI: Update to 3.0.5+|3.1.5+|4.0.1+ or Use ROMIO for IO #446

ax3l opened this issue Jan 21, 2019 · 1 comment
Labels

Comments

@ax3l
Copy link
Member

ax3l commented Jan 21, 2019

A note on using openPMD-api, especially the parallel HDF5 backend, with OpenMPI:

OpenMPI's default for its IO backend is OMPIO, starting with 2.x.

The issues below are fixed in OpenMPI versions:

  • v2.0: affected, not fixed (end-of-life)
  • v2.x: affected, not fixed (end-of-life)
  • v3.0.4 or newer
  • v3.1.4 or newer
  • v4.0.1 or newer

Unfortunately, that backend contains severe bugs leading to data corruption and sporadic crashes as of the latest releases (affected: 2.X to 3.1.3 and 4.0.0). We saw those issues so far with parallel HDF5, but since other MPI-IO-parallel methods such as ADIOS use the same MPI -IO API they are potentially affected as well. Please see open-mpi/ompi#6285 for details.

As a work-around for all systems that rely on OpenMPI (and its derivatives, such as BullMPI), disable the "OMPIO" default IO backend and fallback to the existing ROMIO backend for MPI-I/O until fixed versions are available.

Available runtime switches:

export OMPI_MCA_io=^ompio
mirun ...

or

mpirun --mca io ^ompio ...

Other MPI implementations such as MPICH, and MPICH-based flavors such as IntelMPI, use ROMIO by default (they develop ROMIO) and are not affected.

@ax3l ax3l added bug affects latest release third party third party libraries that are shipped and/or linked backend: HDF5 backend: ADIOS1 labels Jan 21, 2019
@ax3l ax3l pinned this issue Feb 5, 2019
@ax3l ax3l changed the title OpenMPI: Use ROMIO for IO OpenMPI: Update to Latest or Use ROMIO for IO Jul 25, 2019
@ax3l ax3l changed the title OpenMPI: Update to Latest or Use ROMIO for IO OpenMPI: Update to 4.0.1 or Use ROMIO for IO Sep 27, 2019
@ax3l ax3l changed the title OpenMPI: Update to 4.0.1 or Use ROMIO for IO OpenMPI: Update to 3.0.4+|3.1.4+|4.0.1+ or Use ROMIO for IO Sep 27, 2019
@ax3l
Copy link
Member Author

ax3l commented Nov 26, 2019

Another OpenMPI issue limited file writes to 2GByte (per rank).
Update to 3.0.5+, 3.1.5+. Status in 4.0.* release series unknown. -> 4.0.3+ it seems

@ax3l ax3l changed the title OpenMPI: Update to 3.0.4+|3.1.4+|4.0.1+ or Use ROMIO for IO OpenMPI: Update to 3.0.5+|3.1.5+|4.0.1+ or Use ROMIO for IO Nov 26, 2019
ax3l added a commit to ax3l/openPMD-api that referenced this issue Sep 22, 2021
Document OpenMPI MPI-I/O backend control.

We have documented this long in openPMD#446.
franzpoeschel pushed a commit that referenced this issue Sep 24, 2021
Document OpenMPI MPI-I/O backend control.

We have documented this long in #446.
ax3l added a commit to ax3l/openPMD-api that referenced this issue Nov 3, 2021
Document OpenMPI MPI-I/O backend control.

We have documented this long in openPMD#446.
ax3l added a commit that referenced this issue Nov 4, 2021
* Read: time/dt also in long double (#1096)

* Python: time/dt round-trip

Test writing and reading time and dt on an iteration via properties.

* Fix: Iteration read of long double time

Support reading of `dt` and `time` attributes if they are of type
`long double`. (openPMD standard: all `floatX` supported)

* Executables: CXX_STANDARD/EXTENSIONS (#1102)

Set `CXX_EXTENSIONS OFF` and `CXX_STANDARD_REQUIRED ON` for created
executables.

This mitigates issues with NVCC 11.0 and C++17 builds seen as added
`-std=gnu++17` flags that lead to
```
nvcc fatal   : Value 'gnu++17' is not defined for option 'std'
```
when using `nvcc` as CXX compiler directly.

* Doc: More Locations -DPython_EXECUTABLE (#1104)

Mention the `-DPython_EXECUTABLE` twice more in build examples.

* NVCC + C++17 (#1103)

* NVCC + C++17

Work-around a build issue with NVCC in C++17 builds.
```
include/openPMD/backend/Attributable.hpp(437):
error #289: no instance of constructor "openPMD::Attribute::Attribute" matches the argument list
            argument types are: (std::__cxx11::string)
          detected during instantiation of "__nv_bool openPMD::AttributableInterface::setAttribute(const std::__cxx11::string &, T) [with T=std::__cxx11::string]"
```
from
```
inline bool
AttributableInterface::setAttribute( std::string const & key, char const value[] )
{
    return this->setAttribute(key, std::string(value));
}
```

Seen with:
- NVCC 11.0.2 + GCC 8.3.0
- NVCC 11.0.2 + GCC 7.5.0

* NVCC 11.0.2 C++17 work-around: Add Comment

* Lazy parsing: Make findable in docs and use in openpmd-ls (#1111)

* Use deferred iteration parsing in openpmd-ls

* Make lazy/deferred parsing searchable

* Add a way to search for usesteps key

* HDF5: Document HDF5_USE_FILE_LOCKING (#1106)

Document a HDF5 read work-around that we currently need on OLCF
Jupyter (https://jupyter.olcf.ornl.gov), due to a mounting issue
of GPFS in the Jupyter serice (OLCFHELP-3685).

From the HDF5 1.10.1 Release Notes:
```
Other New Features and Enhancements
===================================

    Library
    -------
    - Added a mechanism for disabling the SWMR file locking scheme.

      The file locking calls used in HDF5 1.10.0 (including patch1)
      will fail when the underlying file system does not support file
      locking or where locks have been disabled. To disable all file
      locking operations, an environment variable named
      HDF5_USE_FILE_LOCKING can be set to the five-character string
      'FALSE'. This does not fundamentally change HDF5 library
      operation (aside from initial file open/create, SWMR is lock-free),
      but users will have to be more careful about opening files
      to avoid problematic access patterns (i.e.: multiple writers)
      that the file locking was designed to prevent.

      Additionally, the error message that is emitted when file lock
      operations set errno to ENOSYS (typical when file locking has been
      disabled) has been updated to describe the problem and potential
      resolution better.

      (DER, 2016/10/26, HDFFV-9918)
```

This also exists as a compilation option for HDF5 in CMake, where it
defaults to ``TRUE`` by default, which is also what distributions/
package managers ship.

Disabling from Bash:
```bash
export HDF5_USE_FILE_LOCKING=FALSE
```

Disabling from Python:
```py
import os
os.environ['HDF5_USE_FILE_LOCKING'] = "FALSE"
```

* Avoid object slicing when deriving from Series class (#1107)

* Make Series class final

* Use private constructor to avoid object slicing

* Doc: OMPI_MCA_io Control (#1114)

Document OpenMPI MPI-I/O backend control.

We have documented this long in #446.

* openPMD.hpp: Include auxiliary StringManip (#1124)

Include this, handy functions.

* CXX Std: Remember <variant> Impl. (#1128)

We use `<variant>` or `<mpark/variant.hpp>` in our public API
interface for datatypes, depending on the C++ standard.

This pull request makes sure that the same implementation is used
in downstream code, even if the C++ standard is switched. This avoids
ABI issues when, e.g., using a C++14 built openPMD-api in a C++17
downstream code.

* Spack: No More `load -r` (#1125)

The `-r` argument was removed from `spack load` and is now implied.

* Fix AppVeyor: Python Executable (#1127)

* GH Action: Add MSVC & ClangCL on Win

* Fix AppVeyor: Python Executable

* Avoid mismatching system Python and Conda Python
* Conda: Fix Numpy

* CMake: Skip Pipe Test

Written in a too special way, we cannot assume SH is always present

* Test 8b (Bench Read Parallel): Support Variable encoding, Fix Bugs (#1131)

* added support to read variable encoding, plus fixed some bugs

* fixed style

* Update examples/8b_benchmark_read_parallel.cpp

remove commented out code

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Update examples/8b_benchmark_read_parallel.cpp

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* removed commented line

* updated 8b env option

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* HDF5 I/O optimizations (#1129)

* Include HDF5 optimization options

* Fix code style check

* Fix validations and include checks

* Fix style check

* Remove unecessary strict check

* Update documentation with HDF5 tuning options

* Update contributions

* Fix Guards for H5Pset_all_coll_metadata*

* MPI Guard: H5Pset_all_coll_metadata*

* Remove duplicated variable

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Include known issues section for HDF5 (#1132)

* Update known issues with HDF5 and collective metadata operations

* Fix rst link and tiny typo

* Add targeted bugfix releases.

Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja>

* Include check for paged allocation (#1133)

* Include check for paged allocation

* Update ParallelHDF5IOHandler.cpp

* libfabric 1.6+: Document SST Work-Arounds (#1134)

* libfabric 1.6+: Document SST Work-Arounds

Document work-arounds for libfabric 1.6+ on Cray systems when using
data staging / streaming with ADIOS2 SST.

Co-authored-by: Franz Pöschel <franz.poeschel@gmail.com>

* Fix: Read Inconsistent Zero Pads (#1118)

* [Draft] Fix: Read Inconsistent Zero Pads

Some codes mess up the zero-padding in `fileBased` encoding, e.g.,
when specifying padding to 5 digits but creating >100'000 output
steps.

Files like those cannot yet be parsed and fell back to no padding,
which fails to open the file:
```
openpmd_00000.h5
openpmd_02000.h5
openpmd_101000.h5
openpmd_01000.h5
openpmd_100000.h5
openpmd_104000.h5
```

Error:
```
RuntimeError: [HDF5] Failed to open HDF5 file diags/diag1/openpmd_0.h5
```

* Revert previous changes except for test

Parse iteration numbers that are longer than their padding

Read inconsistent zero padding

* Overflow Padding: Read Test

* Warn if the prefix does end in a digit

* Fix: Don't let oversize numbers accidentally bump the padding

* Update test

* Issue warnings on misleading patterns also when writing

* Minor Style Update

Co-authored-by: Franz Pöschel <franz.poeschel@gmail.com>

* Release: 0.14.3

Co-authored-by: Franz Pöschel <franz.poeschel@gmail.com>
Co-authored-by: guj <guj@users.noreply.github.com>
Co-authored-by: Jean Luca Bez <jeanlucabez@gmail.com>
Co-authored-by: Jean Luca Bez <jlbez@lbl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant