Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Type, MPI_Alltoallw, mpp_global_field update #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aidanheerdegen
Copy link

This is the work of @marshallward

git cherry-pick d216cfd4506be1f369c8a265cb40e9f9b34321c2

This patch contains three new features for FMS: Support for MPI datatypes, an
MPI_Alltoallw interface, and modifications to mpp_global_field to use these
changes for select operations.

These changes were primarily made to improve stability of large (>4000
rank) MPI jobs under OpenMPI at NCI.

There are differences in the performance of mpp_global_field,
occasionally even very large differences, but there is no consistency
across various MPI libraries. One method will be faster in one library,
and slower in another, even across MPI versions. Generally, the
MPI_Alltoallw method showed improved performance on our system, but this
is not a universal result. We therefore introduce a flag to control
this feature.

The inclusion of MPI_Type support may also be seen as an opportunity to
introduce other new MPI features for other operations, e.g. halo
exchange.

Detailed changes are summarised below.

  • MPI data transfer type ("MPI_Type") support has been added to FMS. This is
    done with the following features:

    • A mpp_type derived type has been added, which manages the type details
      and hides the MPI internals from the model developer. Types are managed
      inside of an internal linked list, datatypes.

    Note: The name mpp_type is very similar to the preprocessor variable
    MPP_TYPE_ and should possibly be renamed to something else, e.g.
    mpp_datatype.*

    • mpp_type_create and mpp_type_free are used to create and release these
      types within the MPI library. These append and remove mpp_types from the
      internal linked list, and include reference counters to manage duplicates.

    • A mpp_byte type is created as a module-level variable for default
      operations.

      NOTE: As the first element of the list, it also inadvertently provides
      access to the rest of datatypes, which is private, but there is probably
      some ways to address this.*

  • A MPI_Alltoallw wrapper, using MPI_Types, has been added to the mpp_alltoall
    interface.

  • An implementation of mpp_global_field using MPI_Alltoallw and mpp_types has
    been added. In addition to replacing the point-to-point operations with a
    collective, it also eliminates the need to use the internal MPP stack.

    Since MPI_Alltoallw requires that the input field by contiguous, it is only
    enabled for data domains (i.e. compute + halo). This limitation can be
    overcome, either by copying or more careful attention to layout, but it can
    be addressed in a future patch.

    This method is enabled in the mpp_domains_nml namelist group, by setting
    the use_alltoallw flag to True.

Provisional interfaces to SHMEM and serial ("nocomm") builds have been added,
although they are as yet untested and primarily meant as placeholders for now.

This patch also includes the following changes to support this work.

  • In get_peset, the method used to generate MPI subcommunicators has been
    changed; specifically MPI_Comm_create has been replaced with
    MPI_Comm_create_group. The former is blocking over all ranks, while the
    latter is only blocking over ranks in the subgroup.

    This was done to accommodate IO domains of a single rank, usually due to
    masking, which would result in no communication and cause a model hang.

    It seems that more recent changes in FMS related to handling single-rank
    communicators were made to avoid this particular scenario from happening, but
    I still think that it's more correct to use MPI_Comm_create_group and have
    left the change.

    This is an MPI 3.0 feature, so this might be an issue for older MPI
    libraries.

  • Logical interfaces added to mpp_alltoall and mpp_alltoallv

  • Single-rank PE checks in mpp_alltoall were removed to prevent model hangs
    with the subcommunicators.

  • NULL_PE checks have been added to the original point-to-point implementation
    of mpp_global_field, although these may not be required anymore due to
    changes in the subcommunicator implementation.

    This work was by Nic Hannah, and may actually be part of an existing pull
    request. (TODO: Check this!)

  • Timer events have been added to mpp_type_create and mpp_type_free, although
    they are not yet initialized anywhere.

  • The diagnostic field count was increased from 150 to 250, to support the
    current needs of researchers.

This patch contains three new features for FMS: Support for MPI datatypes, an
MPI_Alltoallw interface, and modifications to mpp_global_field to use these
changes for select operations.

These changes were primarily made to improve stability of large (>4000
rank) MPI jobs under OpenMPI at NCI.

There are differences in the performance of mpp_global_field,
occasionally even very large differences, but there is no consistency
across various MPI libraries.  One method will be faster in one library,
and slower in another, even across MPI versions.  Generally, the
MPI_Alltoallw method showed improved performance on our system, but this
is not a universal result.  We therefore introduce a flag to control
this feature.

The inclusion of MPI_Type support may also be seen as an opportunity to
introduce other new MPI features for other operations, e.g. halo
exchange.

Detailed changes are summarised below.

- MPI data transfer type ("MPI_Type") support has been added to FMS.  This is
  done with the following features:

  -  A `mpp_type` derived type has been added, which manages the type details
    and hides the MPI internals from the model developer.  Types are managed
    inside of an internal linked list, `datatypes`.

    Note: The name `mpp_type` is very similar to the preprocessor variable
    `MPP_TYPE_` and should possibly be renamed to something else, e.g.
    `mpp_datatype`.*

  - `mpp_type_create` and `mpp_type_free` are used to create and release these
    types within the MPI library.  These append and remove mpp_types from the
    internal linked list, and include reference counters to manage duplicates.

  - A `mpp_byte` type is created as a module-level variable for default
    operations.

    NOTE: As the first element of the list, it also inadvertently provides
    access to the rest of `datatypes`, which is private, but there is probably
    some ways to address this.*

- A MPI_Alltoallw wrapper, using MPI_Types, has been added to the mpp_alltoall
  interface.

- An implementation of mpp_global_field using MPI_Alltoallw and mpp_types has
  been added.  In addition to replacing the point-to-point operations with a
  collective, it also eliminates the need to use the internal MPP stack.

  Since MPI_Alltoallw requires that the input field by contiguous, it is only
  enabled for data domains (i.e. compute + halo).  This limitation can be
  overcome, either by copying or more careful attention to layout, but it can
  be addressed in a future patch.

  This method is enabled in the `mpp_domains_nml` namelist group, by setting
  the `use_alltoallw` flag to True.

Provisional interfaces to SHMEM and serial ("nocomm") builds have been added,
although they are as yet untested and primarily meant as placeholders for now.

This patch also includes the following changes to support this work.

- In `get_peset`, the method used to generate MPI subcommunicators has been
  changed; specifically `MPI_Comm_create` has been replaced with
  `MPI_Comm_create_group`.  The former is blocking over all ranks, while the
  latter is only blocking over ranks in the subgroup.

  This was done to accommodate IO domains of a single rank, usually due to
  masking, which would result in no communication and cause a model hang.

  It seems that more recent changes in FMS related to handling single-rank
  communicators were made to avoid this particular scenario from happening, but
  I still think that it's more correct to use `MPI_Comm_create_group` and have
  left the change.

  This is an MPI 3.0 feature, so this might be an issue for older MPI
  libraries.

- Logical interfaces added to mpp_alltoall and mpp_alltoallv

- Single-rank PE checks in mpp_alltoall were removed to prevent model hangs
  with the subcommunicators.

- NULL_PE checks have been added to the original point-to-point implementation
  of mpp_global_field, although these may not be required anymore due to
  changes in the subcommunicator implementation.

  This work was by Nic Hannah, and may actually be part of an existing pull
  request.  (TODO: Check this!)

- Timer events have been added to mpp_type_create and mpp_type_free, although
  they are not yet initialized anywhere.

- The diagnostic field count was increased from 150 to 250, to support the
  current needs of researchers.
@aidanheerdegen
Copy link
Author

Hey @marshallward I am just merging this work you did into our FMS fork. Can you just confirm that this is ok, I assume the code is fine, it is what was accepted into the upstream FMS but I guess we had to branch before it was merged into master.

@marshallward
Copy link
Collaborator

Not really in a good position to test it out, but it looks OK to me. If it's not breaking your runs then I suspect it's fine to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants