Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4: enablement on platforms with 8+ nics #6689

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

wenduwan
Copy link

@wenduwan wenduwan commented Sep 22, 2023

Pull Request Description

Remove the 8-nic limit and enable MPICH to support high nic-count platforms

Ref: #6688

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

/* Initially sort the NICs by name. This way all intranode ranks have a consistent view. */
qsort(MPIDI_OFI_global.prov_use, MPIDI_OFI_global.num_nics, sizeof(struct fi_info *),
compare_nic_names);

if (nic_count > MPIDI_OFI_MAX_NICS) {
fprintf(stderr, "Warning: Detected %d NICs, but only %d(MPIDI_OFI_MAX_NICS) are considered.\n",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please advise a better way to print warnings.

@wenduwan
Copy link
Author

@hzhou @raffenet Could you kindly review the change?

@@ -284,7 +284,7 @@ typedef struct {
} MPIDI_OFI_win_t;

/* Maximum number of network interfaces CH4 can support. */
#define MPIDI_OFI_MAX_NICS 8
#define MPIDI_OFI_MAX_NICS 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this macro and rely on MPIR_CVAR_CH4_OFI_MAX_NICS to control the number of active NICs to use. I think we should reset MPIR_CVAR_CH4_OFI_MAX_NICS default to 8.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzhou Cool this is better. I can give it a go by next week.

I would also like second opinions from Aurora folks since they introduced the limit. Do you know a good contact?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenggb72 @sagarth Some background on this PR. The 8 NIC limit is a blocker for AWS customers to use newer platforms with e.g. 32 NICs. We are exploring the possibility to raise or preferably remove the limit.

Do you see any downside of doing so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like Hui's suggestion. We also went to keep the ability to change the number of NICs with a CVAR, in case we only want to use the NICs on first socket. Its an easier control.

@wenduwan Maybe you could set a machine specific value of this CVAR on your platforms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzhou After reading the usage of MPIDI_OFI_MAX_NICS closer, I think it requires a larger change to eliminate it.

I'm especially curious about these static counters

unsigned long long PVAR_COUNTER_nic_sent_bytes_count[MPIDI_OFI_MAX_NICS] ATTRIBUTE((unused));
unsigned long long PVAR_COUNTER_nic_recvd_bytes_count[MPIDI_OFI_MAX_NICS] ATTRIBUTE((unused));
unsigned long long PVAR_COUNTER_striped_nic_sent_bytes_count[MPIDI_OFI_MAX_NICS]
ATTRIBUTE((unused));
unsigned long long PVAR_COUNTER_striped_nic_recvd_bytes_count[MPIDI_OFI_MAX_NICS]
ATTRIBUTE((unused));
unsigned long long PVAR_COUNTER_rma_pref_phy_nic_put_bytes_count[MPIDI_OFI_MAX_NICS]
ATTRIBUTE((unused));
unsigned long long PVAR_COUNTER_rma_pref_phy_nic_get_bytes_count[MPIDI_OFI_MAX_NICS]
ATTRIBUTE((unused));

Do you think we can/should make them dynamically sized instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think they can be dynamically allocated

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarudoodi Apologies I missed your comment last week. I have retained the CVAR after the change. Please let me know if that is sufficient for your purpose.

@hzhou
Copy link
Contributor

hzhou commented Oct 4, 2023

tag @nitbhat @tarudoodi for review

@wenduwan wenduwan marked this pull request as draft October 4, 2023 20:07
@wenduwan
Copy link
Author

wenduwan commented Oct 4, 2023

Converted change to draft for testing.

Comment on lines +90 to +122
struct fi_info *first_prov = NULL, *prev_prov = NULL, *cur_prov = NULL;
for (cur_prov = prov; cur_prov; cur_prov = cur_prov->next) {
/* additional filtering */
if (MPIR_CVAR_OFI_SKIP_IPV6 && p->addr_format == FI_SOCKADDR_IN6) {
if (MPIR_CVAR_OFI_SKIP_IPV6 && cur_prov->addr_format == FI_SOCKADDR_IN6) {
continue;
}
if (!MPIDI_OFI_nic_is_up(p)) {
if (!MPIDI_OFI_nic_is_up(cur_prov)) {
continue;
}
if (!first_prov) {
first_prov = p;
}
#ifdef HAVE_LIBFABRIC_NIC
/* check the nic */
struct fid_nic *nic = p->nic;
struct fid_nic *nic = cur_prov->nic;
if (nic && nic->bus_attr->bus_type == FI_BUS_PCI &&
!MPIDI_OFI_nic_already_used(p, MPIDI_OFI_global.prov_use, nic_count)) {
MPIDI_OFI_global.prov_use[nic_count] = fi_dupinfo(p);
MPIR_Assert(MPIDI_OFI_global.prov_use[nic_count]);
nic_count++;
if (nic_count == max_nics) {
break;
!MPIDI_OFI_nic_already_used(cur_prov, first_prov)) {
if (prev_prov) {
prev_prov->next = fi_dupinfo(cur_prov);
prev_prov = prev_prov->next;
} else {
first_prov = fi_dupinfo(cur_prov);
prev_prov = first_prov;
}
prev_prov->next = NULL;
++nic_count;
}
#endif
#else
/* Always pick the first provider if NIC info is not available */
if (!first_prov) {
first_prov = fi_dupinfo(cur_prov);
first_prov->next = NULL;
++nic_count;
break;
}
#endif /* HAVE_LIBFABRIC_NIC */
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be extracted to a separate commit if you like. I kept it in the big commit since git add -p does not allow me to split it out.

@wenduwan wenduwan marked this pull request as ready for review October 9, 2023 14:22
Comment on lines -290 to -294
#ifdef MPIDI_OFI_VNI_USE_DOMAIN
fi_addr_t dest[MPIDI_OFI_MAX_NICS][MPIDI_CH4_MAX_VCIS]; /* [nic][vci] */
#else
fi_addr_t dest[MPIDI_OFI_MAX_NICS][1];
#endif
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my update I chose to ignore MPIDI_OFI_VNI_USE_DOMAIN - instead I made dest to be always nics x MPIDI_CH4_MAX_VCIS. I believe MPIDI_CH4_MAX_VCIS by default is 1.

@wenduwan
Copy link
Author

wenduwan commented Oct 9, 2023

@hzhou Please take a look at the update. I followed your suggestion to deprecate MPIDI_OFI_MAX_NICS. As a result I had to change a few static variables and data structures to make their size and memory dynamic. I tried to make sure the memory is alloced and freed properly.

For my testing, I ran the osu microbenchmarks on 2 different platforms with ./configure ... --enable-g=all

  1. hpc6a.48xlarge with 1 NIC
  2. p5.48xlarge with 32 NICs

I added prints and verified that the code selects the right nic for each rank, and MPIR_CVAR_CH4_OFI_MAX_NICS can be used to restrict NIC visibility.

Please let me know your thoughts.

@@ -399,7 +399,7 @@ typedef struct {
/* Process management and PMI globals */
int pname_set;
int pname_len;
char addrname[MPIDI_OFI_MAX_NICS][FI_NAME_MAX];
char addrname[FI_NAME_MAX];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to account for the new num_nics here too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarudoodi I don't think so. This change is in its own commit. addrname is currently used as a singular char array/string, and I think it was defined as a 2-d array by mistake.

@tarudoodi
Copy link
Contributor

@wenduwan The PR looks ok to me. It keeps the usage that we had intended too. Please run the tests to make sure the testsuite passes.

Move MPIDI_av_table decl to mpidpre.h so that it has the same visibility
as MPIDI_av_entry

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
MPIDI_OFI_global_t.addrname is used as a single char array, no need to
make it a 2-D matrix.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
This patch removes the MPIDI_OFI_MAX_NICS variable which restricts the
max number of NICs that can be used by MPICH.

This change requires changing usages of MPIDI_OFI_MAX_NICS and
converting statically allocated arrays to dynamic arrays. As a result,
malloc and free procredures are added to init and finalize hooks.

This patch retains the current behavior of MPIR_CVAR_CH4_OFI_MAX_NICS,
which by default uses all NICs available from the OFI provider; the
application can restrict the NIC count by setting this CVAR.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan
Copy link
Author

@wenduwan The PR looks ok to me. It keeps the usage that we had intended too. Please run the tests to make sure the testsuite passes.

@tarudoodi Thank you for looking! I have applied the whitespace diff and updated the PR.

@nitbhat
Copy link
Contributor

nitbhat commented Oct 17, 2023

tag @nitbhat @tarudoodi for review

Sorry the delay in getting back on this. Reviewed, the changes look good.

Do the CI tests exercise multi-nic code? Otherwise, it'll be good to manually test on some major multi-nic platforms to ensure that nic assignment looks good. (Platforms could be: aws instances, Aurora, Infiniband systems with multiple nics).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants