New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4: enablement on platforms with 8+ nics #6689
base: main
Are you sure you want to change the base?
Conversation
src/mpid/ch4/netmod/ofi/ofi_nic.c
Outdated
/* Initially sort the NICs by name. This way all intranode ranks have a consistent view. */ | ||
qsort(MPIDI_OFI_global.prov_use, MPIDI_OFI_global.num_nics, sizeof(struct fi_info *), | ||
compare_nic_names); | ||
|
||
if (nic_count > MPIDI_OFI_MAX_NICS) { | ||
fprintf(stderr, "Warning: Detected %d NICs, but only %d(MPIDI_OFI_MAX_NICS) are considered.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please advise a better way to print warnings.
919ff2c
to
942f58e
Compare
src/mpid/ch4/netmod/ofi/ofi_pre.h
Outdated
@@ -284,7 +284,7 @@ typedef struct { | |||
} MPIDI_OFI_win_t; | |||
|
|||
/* Maximum number of network interfaces CH4 can support. */ | |||
#define MPIDI_OFI_MAX_NICS 8 | |||
#define MPIDI_OFI_MAX_NICS 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this macro and rely on MPIR_CVAR_CH4_OFI_MAX_NICS
to control the number of active NICs to use. I think we should reset MPIR_CVAR_CH4_OFI_MAX_NICS
default to 8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzhou Cool this is better. I can give it a go by next week.
I would also like second opinions from Aurora folks since they introduced the limit. Do you know a good contact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhenggb72 @sagarth Some background on this PR. The 8 NIC limit is a blocker for AWS customers to use newer platforms with e.g. 32 NICs. We are exploring the possibility to raise or preferably remove the limit.
Do you see any downside of doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like Hui's suggestion. We also went to keep the ability to change the number of NICs with a CVAR, in case we only want to use the NICs on first socket. Its an easier control.
@wenduwan Maybe you could set a machine specific value of this CVAR on your platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzhou After reading the usage of MPIDI_OFI_MAX_NICS
closer, I think it requires a larger change to eliminate it.
I'm especially curious about these static counters
mpich/src/mpid/ch4/netmod/ofi/globals.c
Lines 10 to 19 in 69522c2
unsigned long long PVAR_COUNTER_nic_sent_bytes_count[MPIDI_OFI_MAX_NICS] ATTRIBUTE((unused)); | |
unsigned long long PVAR_COUNTER_nic_recvd_bytes_count[MPIDI_OFI_MAX_NICS] ATTRIBUTE((unused)); | |
unsigned long long PVAR_COUNTER_striped_nic_sent_bytes_count[MPIDI_OFI_MAX_NICS] | |
ATTRIBUTE((unused)); | |
unsigned long long PVAR_COUNTER_striped_nic_recvd_bytes_count[MPIDI_OFI_MAX_NICS] | |
ATTRIBUTE((unused)); | |
unsigned long long PVAR_COUNTER_rma_pref_phy_nic_put_bytes_count[MPIDI_OFI_MAX_NICS] | |
ATTRIBUTE((unused)); | |
unsigned long long PVAR_COUNTER_rma_pref_phy_nic_get_bytes_count[MPIDI_OFI_MAX_NICS] | |
ATTRIBUTE((unused)); |
Do you think we can/should make them dynamically sized instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think they can be dynamically allocated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarudoodi Apologies I missed your comment last week. I have retained the CVAR after the change. Please let me know if that is sufficient for your purpose.
tag @nitbhat @tarudoodi for review |
Converted change to draft for testing. |
struct fi_info *first_prov = NULL, *prev_prov = NULL, *cur_prov = NULL; | ||
for (cur_prov = prov; cur_prov; cur_prov = cur_prov->next) { | ||
/* additional filtering */ | ||
if (MPIR_CVAR_OFI_SKIP_IPV6 && p->addr_format == FI_SOCKADDR_IN6) { | ||
if (MPIR_CVAR_OFI_SKIP_IPV6 && cur_prov->addr_format == FI_SOCKADDR_IN6) { | ||
continue; | ||
} | ||
if (!MPIDI_OFI_nic_is_up(p)) { | ||
if (!MPIDI_OFI_nic_is_up(cur_prov)) { | ||
continue; | ||
} | ||
if (!first_prov) { | ||
first_prov = p; | ||
} | ||
#ifdef HAVE_LIBFABRIC_NIC | ||
/* check the nic */ | ||
struct fid_nic *nic = p->nic; | ||
struct fid_nic *nic = cur_prov->nic; | ||
if (nic && nic->bus_attr->bus_type == FI_BUS_PCI && | ||
!MPIDI_OFI_nic_already_used(p, MPIDI_OFI_global.prov_use, nic_count)) { | ||
MPIDI_OFI_global.prov_use[nic_count] = fi_dupinfo(p); | ||
MPIR_Assert(MPIDI_OFI_global.prov_use[nic_count]); | ||
nic_count++; | ||
if (nic_count == max_nics) { | ||
break; | ||
!MPIDI_OFI_nic_already_used(cur_prov, first_prov)) { | ||
if (prev_prov) { | ||
prev_prov->next = fi_dupinfo(cur_prov); | ||
prev_prov = prev_prov->next; | ||
} else { | ||
first_prov = fi_dupinfo(cur_prov); | ||
prev_prov = first_prov; | ||
} | ||
prev_prov->next = NULL; | ||
++nic_count; | ||
} | ||
#endif | ||
#else | ||
/* Always pick the first provider if NIC info is not available */ | ||
if (!first_prov) { | ||
first_prov = fi_dupinfo(cur_prov); | ||
first_prov->next = NULL; | ||
++nic_count; | ||
break; | ||
} | ||
#endif /* HAVE_LIBFABRIC_NIC */ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can be extracted to a separate commit if you like. I kept it in the big commit since git add -p
does not allow me to split it out.
#ifdef MPIDI_OFI_VNI_USE_DOMAIN | ||
fi_addr_t dest[MPIDI_OFI_MAX_NICS][MPIDI_CH4_MAX_VCIS]; /* [nic][vci] */ | ||
#else | ||
fi_addr_t dest[MPIDI_OFI_MAX_NICS][1]; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my update I chose to ignore MPIDI_OFI_VNI_USE_DOMAIN
- instead I made dest
to be always nics
x MPIDI_CH4_MAX_VCIS
. I believe MPIDI_CH4_MAX_VCIS
by default is 1.
@hzhou Please take a look at the update. I followed your suggestion to deprecate For my testing, I ran the osu microbenchmarks on 2 different platforms with
I added prints and verified that the code selects the right nic for each rank, and Please let me know your thoughts. |
@@ -399,7 +399,7 @@ typedef struct { | |||
/* Process management and PMI globals */ | |||
int pname_set; | |||
int pname_len; | |||
char addrname[MPIDI_OFI_MAX_NICS][FI_NAME_MAX]; | |||
char addrname[FI_NAME_MAX]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to account for the new num_nics here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarudoodi I don't think so. This change is in its own commit. addrname
is currently used as a singular char array/string, and I think it was defined as a 2-d array by mistake.
@wenduwan The PR looks ok to me. It keeps the usage that we had intended too. Please run the tests to make sure the testsuite passes. |
Move MPIDI_av_table decl to mpidpre.h so that it has the same visibility as MPIDI_av_entry Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
MPIDI_OFI_global_t.addrname is used as a single char array, no need to make it a 2-D matrix. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
This patch removes the MPIDI_OFI_MAX_NICS variable which restricts the max number of NICs that can be used by MPICH. This change requires changing usages of MPIDI_OFI_MAX_NICS and converting statically allocated arrays to dynamic arrays. As a result, malloc and free procredures are added to init and finalize hooks. This patch retains the current behavior of MPIR_CVAR_CH4_OFI_MAX_NICS, which by default uses all NICs available from the OFI provider; the application can restrict the NIC count by setting this CVAR. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
b4fbfcf
to
22ba834
Compare
@tarudoodi Thank you for looking! I have applied the whitespace diff and updated the PR. |
Sorry the delay in getting back on this. Reviewed, the changes look good. Do the CI tests exercise multi-nic code? Otherwise, it'll be good to manually test on some major multi-nic platforms to ensure that nic assignment looks good. (Platforms could be: aws instances, Aurora, Infiniband systems with multiple nics). |
Pull Request Description
Remove the 8-nic limit and enable MPICH to support high nic-count platforms
Ref: #6688
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.