New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix collective modules initialization and finalization #12429
Conversation
@bosilca Is this a bugfix or enhancement? Asking because the title says 'fix' so I want to understand what is exactly broken. |
It's both a fix of a lingering issue around collective module init/fini and an enhancement to provide a clean way to do it. Currently, there is no proper way to release the communicator collective module (not the module itself, but the part it attaches to each communicator). We do have some refcount, but half of the modules are not implementing it. We also added the enable/disable callback but it has two issues: can only be triggered from the upper layer, and it must be called once per collective per module (basically a lot of unnecessary calls). Some collective modules (hcoll/ucc) have tried to address the issue by adding an attribute to communicators hoping the |
047d43a
to
c03b06c
Compare
FWIW AWS CI failed on the 1st revision yesterday with error
I'm running it again |
I'm using |
c03b06c
to
8335f06
Compare
@bosilca I made the change to the new HAN gatherv/scatterv files and force pushed to your branch |
Thanks. I need to repush, I forgot to add my copyright info. |
200379e
to
ebcc4e0
Compare
I'm seeing the same errors complaining none of the coll components can be used. |
Can you run with |
|
I don't think this output is with verbose set to 50 as it does not contain any information about the collective replacement. |
I should mention that I got this error from osu_latency_mt(and many others). I see the mca is correct from log
|
3153988
to
3333e1f
Compare
I found the issue, HAN was overwriting the allgatherv in all cases, even when HAN did not support the allgatherv (in which case it was overwriting it with NULL). @wenduwan please retest again. |
From what I understand this PR, passes on the task of setting and unsetting (installing/uninstalling) the pointers in Could you elaborate a little bit on the extra possibilities that this offers? (at the moment or in the future)
Does this refer to changing the pointers in c_coll to no longer point to a component/module?
Which refcount is that? (asking for encyclopedic reasons too). Is it something other than the object refcount (
Do you refer to the fact that currently
What does this refer to? I.e. what have they tried to address with this method? Properly freeing resources when the module/comm is destroyed? |
Yes, this PR addresses the selection of individual collective function pointers for a communicator (for each communicator in fact). In the original code, base was setting the function pointers, by iterating over all modules ordered in the inverse of their priority (which was obtained at a previous step when the The new code make For your last point, because the refcount updates were not correctly done by many of the collective components it was impossible to properly release the collective modules data (even after the communicator was released). Thus, the resources used by the collective components were monotonically increasing with the total number of communicators in the application, and these resources were never released. To cope with this visible memory leakage, some collective components use the delete_fn of an attribute attached to |
Thanks for the detailed response! I concur on making Yes indeed as I recall from experiences with XHC, ref counting in its current form can be a bit slippery. I like the improvement of not retaining a module every time a function of it is installed in c_coll, or every time its kept as a fallback. About releasing resources, I ask potentially ignorantly: can't those just be freed inside Finally, regarding installing the pointers, if you were looking for opinions here, I find the current method of coll/base doing it rather neat, and am not sure I see the immediate benefit from having the components do it instead. Is the rationale to give more control to the components? Of course yes it's not a big problem doing it in the component either.. Slightly off topic, but what has kinda always bugged me vis-a-vis handling |
|
Thanks, ack! I've also developed the necessary changes for this in XHC. Is there a known plan/timeframe for merging XHC and this? If XHC is merged first (it's good to go AFAIK), I can send over the patch to have it included in this PR. |
b3c22d9
to
91ddbb0
Compare
Instead of allowing each collective module to present a list of functions it provide, let them register the functions they provide and save the context of the previous collective if they choose to. There are two major benefits to this approach: - tighter memory management in the collective module themselves. Each collective enable and disable is called exactly once per communicator, to chain or unchain themselves from the collective function pointers struct. The disable is called in the reverse order of the enable, allowing for proper chaining of collectives. - they only install the functions they want. So instead of checking in the coll_select all the functions for all modules, each module can now selectively iterate over only the functions it provides. What is still broken is the ability of a particular collective module to unchain itself in the middle of the execution. Instead, a properly implemented module will have an enable/disable flag, and it should act as a passthrough if it chooses to desactivate. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
91ddbb0
to
f2dfbba
Compare
Passed AWS CI |
@bosilca can you cherry pick to v5.0? |
This PR looks much bigger than it actually is. It unfortunately touches the initialization/finalization of all collective modules, but the changes are minimal and cleaner and the API does not change. Below a more detailed description.
Instead of allowing each collective module to present a list of functions it provide, let them register the functions they provide and save the context of the previous collective if they choose to.
There are two major benefits to this approach:
What is still broken is the ability of a particular collective module to unchain itself in the middle of the execution. Instead, a properly implemented module will have an enable/disable flag, and it should act as a passthrough if it chooses to deactivate.