Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix collective modules initialization and finalization #12429

Merged
merged 1 commit into from May 1, 2024

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Mar 21, 2024

This PR looks much bigger than it actually is. It unfortunately touches the initialization/finalization of all collective modules, but the changes are minimal and cleaner and the API does not change. Below a more detailed description.

Instead of allowing each collective module to present a list of functions it provide, let them register the functions they provide and save the context of the previous collective if they choose to.

There are two major benefits to this approach:

  • tighter memory management in the collective module themselves. Each collective enable and disable is called exactly once per communicator, to chain or unchain themselves from the collective function pointers struct. The disable is called in the reverse order of the enable, allowing for proper chaining of collectives.
  • they only install the functions they want. So instead of checking in the coll_select all the functions for all modules, each module can now selectively iterate over only the functions it provides.

What is still broken is the ability of a particular collective module to unchain itself in the middle of the execution. Instead, a properly implemented module will have an enable/disable flag, and it should act as a passthrough if it chooses to deactivate.

@wenduwan
Copy link
Contributor

@bosilca Is this a bugfix or enhancement? Asking because the title says 'fix' so I want to understand what is exactly broken.

@bosilca
Copy link
Member Author

bosilca commented Mar 22, 2024

It's both a fix of a lingering issue around collective module init/fini and an enhancement to provide a clean way to do it.

Currently, there is no proper way to release the communicator collective module (not the module itself, but the part it attaches to each communicator). We do have some refcount, but half of the modules are not implementing it. We also added the enable/disable callback but it has two issues: can only be triggered from the upper layer, and it must be called once per collective per module (basically a lot of unnecessary calls).

Some collective modules (hcoll/ucc) have tried to address the issue by adding an attribute to communicators hoping the delete_fn will be called, but that is a requirement only for MPI_COMM_WORLD (except when MPI_COMM_FREE is called).

@bosilca bosilca force-pushed the topic/fix_collective_init_fini branch 2 times, most recently from 047d43a to c03b06c Compare March 22, 2024 16:13
@wenduwan
Copy link
Contributor

FWIW AWS CI failed on the 1st revision yesterday with error

...
--------------------------------------------------------------------------

Although some coll components are available on your system, none of

them said that they could be used for allgatherv on a new communicator.


This is extremely unusual -- either the "basic", "libnbc" or "self" components

should be able to be chosen for any communicator.  As such, this

likely means that something else is wrong (although you should double

check that the "basic", "libnbc" and "self" coll components are available on

your system -- check the output of the "ompi_info" command).

A coll module failed to finalize properly when a communicator that was

using it was destroyed.


This is somewhat unusual: the module itself may be at fault, or this

may be a symptom of another issue (e.g., a memory problem).

--------------------------------------------------------------------------

�--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems.  This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):


  mca_coll_base_comm_select(MPI_COMM_WORLD) failed

  --> Returned "Not found" (-13) instead of "Success" (0)

...

I'm running it again

@bosilca
Copy link
Member Author

bosilca commented Mar 22, 2024

I'm using --mca coll_base_verbose 50 to see what collective module is or not used. There is a lot of output, so this approach is mostly for small scale jobs.

@bosilca bosilca self-assigned this Mar 25, 2024
@wenduwan wenduwan force-pushed the topic/fix_collective_init_fini branch from c03b06c to 8335f06 Compare March 25, 2024 14:50
@wenduwan
Copy link
Contributor

@bosilca I made the change to the new HAN gatherv/scatterv files and force pushed to your branch

@bosilca
Copy link
Member Author

bosilca commented Mar 25, 2024

Thanks. I need to repush, I forgot to add my copyright info.

@bosilca bosilca force-pushed the topic/fix_collective_init_fini branch 2 times, most recently from 200379e to ebcc4e0 Compare March 25, 2024 15:57
@wenduwan
Copy link
Contributor

I'm seeing the same errors complaining none of the coll components can be used.

@bosilca
Copy link
Member Author

bosilca commented Mar 27, 2024

Can you run with --mca coll_base_verbose 50 and send me the output please.

@wenduwan
Copy link
Contributor

[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: registering framework coll components
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component accelerator
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component accelerator register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component tuned
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component tuned register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component monitoring
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component monitoring register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component self
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component self register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component basic
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component basic register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component adapt
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component adapt register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component ftagree
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component ftagree register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component inter
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component inter register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component libnbc
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component libnbc register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component sm
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component sm register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component sync
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component sync register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: found loaded component han
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_register: component han register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: opening coll components
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component accelerator
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component tuned
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: component tuned open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component monitoring
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: component monitoring open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component self
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component basic
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component adapt
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: component adapt open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component ftagree
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component inter
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component libnbc
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: component libnbc open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component sm
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component sync
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: found loaded component han
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: components_open: component han open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: registering framework coll components
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component accelerator
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component accelerator register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component tuned
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component tuned register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component monitoring
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component monitoring register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component self
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component self register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component basic
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component basic register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component adapt
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component adapt register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component ftagree
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component ftagree register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component inter
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component inter register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component libnbc
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component libnbc register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component sm
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component sm register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component sync
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component sync register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: found loaded component han
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_register: component han register function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: opening coll components
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component accelerator
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component tuned
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: component tuned open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component monitoring
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: component monitoring open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component self
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component basic
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component adapt
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: component adapt open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component ftagree
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component inter
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component libnbc
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: component libnbc open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component sm
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component sync
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: found loaded component han
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: components_open: component han open function successful
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component accelerator
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component accelerator is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component tuned
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component tuned is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component monitoring
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component monitoring is not available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: close: component monitoring closed
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] mca: base: close: unloading component monitoring
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component accelerator
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component accelerator is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component tuned
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component tuned is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component monitoring
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component monitoring is not available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: close: component monitoring closed
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] mca: base: close: unloading component monitoring
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component self
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component self is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component basic
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component basic is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component adapt
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component adapt is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component ftagree
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component ftagree is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component inter
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component inter is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component libnbc
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component libnbc is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component sm
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:sm:init_query: pick me! pick me!
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component sm is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component sync
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component sync is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: querying coll component han
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:han:init_query: pick me! pick me!
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:find_available: coll component han is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component self
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component self is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component basic
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component basic is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component adapt
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component adapt is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component ftagree
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component ftagree is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component inter
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component inter is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component libnbc
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component libnbc is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component sm
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:sm:init_query: pick me! pick me!
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component sm is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component sync
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component sync is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: querying coll component han
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:han:init_query: pick me! pick me!
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:find_available: coll component han is available
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: new communicator: MPI_COMM_WORLD (cid 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: Checking all available modules
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component available: accelerator, priority: 78
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component available: tuned, priority: 30
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: new communicator: MPI_COMM_WORLD (cid 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: Checking all available modules
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component not available: self
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component disqualified: self (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component available: basic, priority: 10
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:adapt:comm_query (0/MPI_COMM_WORLD): pick me! pick me!
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component available: adapt, priority: 0
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component not available: ftagree
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component disqualified: ftagree (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component available: accelerator, priority: 78
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component available: tuned, priority: 30
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component not available: self
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component disqualified: self (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component not available: inter
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component disqualified: inter (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component available: libnbc, priority: 10
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component available: basic, priority: 10
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:adapt:comm_query (0/MPI_COMM_WORLD): pick me! pick me!
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component available: adapt, priority: 0
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component not available: ftagree
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component disqualified: ftagree (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component not available: inter
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component disqualified: inter (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:sm:comm_query (0/MPI_COMM_WORLD): intercomm, comm is too small, or not all peers local; disqualifying myself
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component not available: sm
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component disqualified: sm (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component not available: sync
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component disqualified: sync (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:han:comm_query (0/MPI_COMM_WORLD): pick me! pick me!
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component available: libnbc, priority: 10
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:sm:comm_query (0/MPI_COMM_WORLD): intercomm, comm is too small, or not all peers local; disqualifying myself
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component not available: sm
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component disqualified: sm (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: component available: han, priority: 35
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: selecting       adapt, priority   0, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: selecting       basic, priority  10, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: selecting      libnbc, priority  10, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component not available: sync
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component disqualified: sync (priority -1 < 0)
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:han:comm_query (0/MPI_COMM_WORLD): pick me! pick me!
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: component available: han, priority: 35
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: selecting       adapt, priority   0, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: selecting       basic, priority  10, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: selecting      libnbc, priority  10, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: selecting       tuned, priority  30, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: selecting       tuned, priority  30, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: selecting         han, priority  35, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-2:147174] coll:base:comm_select: selecting  accelerator, priority  78, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:han:reduce_reproducible: fallback on tuned
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:han:allreduce_reproducible: fallback on tuned
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: selecting         han, priority  35, Enabled
[queue-hpc7g16xlarge-dy-hpc7g16xlarge-1:226399] coll:base:comm_select: selecting  accelerator, priority  78, Enabled
--------------------------------------------------------------------------
Although some coll components are available on your system, none of
them said that they could be used for allgatherv on a new communicator.

This is extremely unusual -- either the "basic", "libnbc" or "self" components
should be able to be chosen for any communicator.  As such, this
likely means that something else is wrong (although you should double
check that the "basic", "libnbc" and "self" coll components are available on
your system -- check the output of the "ompi_info" command).
A coll module failed to finalize properly when a communicator that was
using it was destroyed.

This is somewhat unusual: the module itself may be at fault, or this
may be a symptom of another issue (e.g., a memory problem).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_coll_base_comm_select(MPI_COMM_WORLD) failed
  --> Returned "Not found" (-13) instead of "Success" (0)

@bosilca
Copy link
Member Author

bosilca commented Mar 28, 2024

I don't think this output is with verbose set to 50 as it does not contain any information about the collective replacement.

@wenduwan
Copy link
Contributor

I don't think this output is with verbose set to 50 as it does not contain any information about the collective replacement.

I should mention that I got this error from osu_latency_mt(and many others). I see the mca is correct from log

mpirun --wdir . -n 2 --hostfile hostfile --map-by ppr:1:node --timeout 1800 --mca coll_base_verbose 50 -x LD_LIBRARY_PATH=/opt/amazon/efa/lib -x PATH --bind-to package osu-micro-benchmarks/mpi/pt2pt/osu_latency_mt

@bosilca bosilca force-pushed the topic/fix_collective_init_fini branch 4 times, most recently from 3153988 to 3333e1f Compare April 20, 2024 18:16
@bosilca
Copy link
Member Author

bosilca commented Apr 20, 2024

I found the issue, HAN was overwriting the allgatherv in all cases, even when HAN did not support the allgatherv (in which case it was overwriting it with NULL). @wenduwan please retest again.

@gkatev
Copy link
Contributor

gkatev commented Apr 20, 2024

From what I understand this PR, passes on the task of setting and unsetting (installing/uninstalling) the pointers in c_coll, from coll/base to the various components. I imagine this is a step in a direction of giving more control to individual components so they can do more advanced stuff?

Could you elaborate a little bit on the extra possibilities that this offers? (at the moment or in the future)

Currently, there is no proper way to release the communicator collective module (not the module itself, but the part it attaches to each communicator).

Does this refer to changing the pointers in c_coll to no longer point to a component/module?

We do have some refcount, but half of the modules are not implementing it.

Which refcount is that? (asking for encyclopedic reasons too). Is it something other than the object refcount (obj_reference_count)?

We also added the enable/disable callback but it has two issues: can only be triggered from the upper layer, and it must be called once per collective per module (basically a lot of unnecessary calls).

Do you refer to the fact that currently module->coll_module_disable is called multiple times for each module, once for each collective function that it provides, instead of just once per whole module?

Some collective modules (hcoll/ucc) have tried to address the issue by adding an attribute to communicators hoping the delete_fn will be called, but that is a requirement only for MPI_COMM_WORLD (except when MPI_COMM_FREE is called).

What does this refer to? I.e. what have they tried to address with this method? Properly freeing resources when the module/comm is destroyed?

@bosilca
Copy link
Member Author

bosilca commented Apr 20, 2024

Yes, this PR addresses the selection of individual collective function pointers for a communicator (for each communicator in fact).

In the original code, base was setting the function pointers, by iterating over all modules ordered in the inverse of their priority (which was obtained at a previous step when the query member was called for each component) and calling the enable member (once per module). At the end of enable all functions set on the module will then be transferred into the communicator and will replace the selection from the prior steps. During this stage the refcount for each module was updated accordingly (incremented when the module provided a function, and decremented when the module selection was replaced with a new one). Thus the complexity of the selection was significant with an upper bound of number of modules * total number of collective functions of setting pointers and twice that number of atomic updates. Because the functions were copies, at every step the communicator had a full selection of valid collective functions. As we interleaved this with call to enable for the next module, a module could have saved the prior state, allowing them to internally chain to each other (such that if a module was not providing support for some collective capability such as IN_PLACE it had a way to fall back onto some other module). For the unselecting step the operations were reversed, disable was called for each set functions and the refcount updated. Because of the fallback mechanism described above the complexity of this step was higher than the selection step, number of modules * total number of collective functions of calls to disable, set and unset of function pointers and refcount updates. Moreover, the enable and disable calls were not symmetric, enable was called once per module while disable was called once per function provided for each module.

The new code make enable and disable symmetric by transferring the responsibility of setting the individual collective function pointers to the module itself. Thus the select process become: the base will iterate over all module, ordered by the inverse of their priority obtained during the query step (as before), and then their enable will be called. During the enable call, the module will replace and/or chain to the previous module for whatever collective it wants. No need for refcount manipulation, as each module will remain valid for the entire existence of the communicator. Et voila, less calls, less refcount updates, less mess, smaller complexity. For unselect, we iterate over all module this time in the order of their priority, and call disable once per module, allowing the module to unchain itself for all collectives of interest. enable / disable are now symmetric, and will only be called once for each module (per communicator).

For your last point, because the refcount updates were not correctly done by many of the collective components it was impossible to properly release the collective modules data (even after the communicator was released). Thus, the resources used by the collective components were monotonically increasing with the total number of communicators in the application, and these resources were never released. To cope with this visible memory leakage, some collective components use the delete_fn of an attribute attached to MPI_COMM_WORLD, which is guaranteed by the MPI standard to be called early in MPI_Finalize (a little like a atexit but for MPI), and release all the component internal resources (including all allocated modules data). This will provide a clean slate for tools looking for memory leaks for the entire application, however it will not address the memory consumption during the execution of that application. With the new approach, each collective module can release all it's data at the end of the disable call, minimizing the memory use.

@gkatev
Copy link
Contributor

gkatev commented Apr 21, 2024

Thanks for the detailed response!

I concur on making disable be called only once per module. (some time back I opened #11126 for the same thing, which kind of got forgotten though)

Yes indeed as I recall from experiences with XHC, ref counting in its current form can be a bit slippery. I like the improvement of not retaining a module every time a function of it is installed in c_coll, or every time its kept as a fallback.

About releasing resources, I ask potentially ignorantly: can't those just be freed inside disable? For example in XHC the module destructor and disable do the same thing.

Finally, regarding installing the pointers, if you were looking for opinions here, I find the current method of coll/base doing it rather neat, and am not sure I see the immediate benefit from having the components do it instead. Is the rationale to give more control to the components? Of course yes it's not a big problem doing it in the component either..

Slightly off topic, but what has kinda always bugged me vis-a-vis handling c_coll pointers are fallbacks. Deep down I was/am hoping something could be implemented in coll/base to intuitively handle falling back to a previous module, instead of having to do all the bookkeeping and handling in the component. Anyway it's not like I have any plan in mind for something like this.. Just mentioning it in passing as it somewhat relates to base vs components having the responsibility of adjusting c_coll.

@bosilca
Copy link
Member Author

bosilca commented Apr 22, 2024

  1. can't [resources] just be freed inside disable: yes, but only if the refcounting of the module object is done properly. This should be fixed with this patch, each collective module gets a single refcount, and thus only need to be disabled once. This was the issue with coll/base: call coll_module_disable only once per module #11126.
  2. I also like the coll/base doing the function replacement, but it is expensive (detailed in an earlier comment) and breaks the flow. As a result, a module must assume that all the functions it provided have been selected and will be used. With the new approach, the module installs its own functions, so it knows exactly what will be in use. It also allows to unify the fallback and install mechanism as both can now be done in a single place, instead of having to be split between enable and coll/base.
  3. fallback in base could be neat but it would require each module to have twice the table of collective function pointers, once for what it provides and once for the fallback. This will almost double the memory needed for each collective module, thus double the memory for the collective management for each communicator. A significant memory use for very little benefit, especially as most of the collective modules only provide support for few collectives.

@gkatev
Copy link
Contributor

gkatev commented Apr 22, 2024

Thanks, ack!

I've also developed the necessary changes for this in XHC. Is there a known plan/timeframe for merging XHC and this? If XHC is merged first (it's good to go AFAIK), I can send over the patch to have it included in this PR.

@bosilca bosilca force-pushed the topic/fix_collective_init_fini branch 3 times, most recently from b3c22d9 to 91ddbb0 Compare April 29, 2024 22:19
Instead of allowing each collective module to present a list of
functions it provide, let them register the functions they provide and
save the context of the previous collective if they choose to.

There are two major benefits to this approach:
- tighter memory management in the collective module themselves. Each
  collective enable and disable is called exactly once per communicator,
  to chain or unchain themselves from the collective function pointers
  struct. The disable is called in the reverse order of the enable,
  allowing for proper chaining of collectives.
- they only install the functions they want. So instead of checking in
  the coll_select all the functions for all modules, each module can now
  selectively iterate over only the functions it provides.

What is still broken is the ability of a particular collective module to
unchain itself in the middle of the execution. Instead, a properly
implemented module will have an enable/disable flag, and it should act
as a passthrough if it chooses to desactivate.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@bosilca bosilca force-pushed the topic/fix_collective_init_fini branch from 91ddbb0 to f2dfbba Compare April 30, 2024 15:24
@bosilca bosilca requested a review from wenduwan April 30, 2024 15:33
@wenduwan
Copy link
Contributor

Passed AWS CI

@bosilca bosilca merged commit bf2068a into open-mpi:main May 1, 2024
13 checks passed
@bosilca bosilca deleted the topic/fix_collective_init_fini branch May 1, 2024 13:05
@janjust
Copy link
Contributor

janjust commented May 1, 2024

@bosilca can you cherry pick to v5.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants