Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fi_av_set/coll_av_set: error when trying to create an empty set #9738

Open
rdfriese opened this issue Jan 17, 2024 · 1 comment
Open

fi_av_set/coll_av_set: error when trying to create an empty set #9738

rdfriese opened this issue Jan 17, 2024 · 1 comment

Comments

@rdfriese
Copy link

I am trying to create an empty av set of collective operations, where I will later add remote addresses (which are more complex than (start_addr..end_addr).step_by(stride) )

I am setting an av_set_attr and trying to create the av set as follows:

struct fi_av_set_attr av_set_attr = {0};
av_set_attr.count = num_remote_nodes;
av_set_attr.start_addr = FI_ADDR_NOTAVAIL;
av_set_attr.end_addr = FI_ADDR_NOTAVAIL;
av_set_attr.stride = 0;
av_set_attr.flags = 0;

struct fid_av_set *av_set;
int ret = fi_av_set(av, &av_set_attr, &av_set, NULL); //av initialized else where
if (ret) {
  printf("error: %s\n",fi_strerror(ret));
}

for (int i=0;i < num_nodes; i++){
  ret = fi_av_set_insert(av_set,remote_addrs[node_map[i]]); //node_maps takes a "collective" node id and maps it to its world id
  ...
}

This returns the following error: fi_av_st failed: Cannot allocate memory (-12)

Digging into this a little deeper it appears that in

for (i = attr->start_addr; i <= attr->end_addr; i += attr->stride) {

The for loop is incorrectly entered because start_addr = end_addr = FI_ADDR_NOTAVAIL , and then never exits the loop because stride=0

A simple check before the loop:

if (attr->start_addr != FI_ADDR_NOTAVAIL && attr->end_addr != FI_ADDR_NOTAVAIL) {
        for (i = attr->start_addr; i <= attr->end_addr; i += attr->stride) {
            if (av_set->fi_addr_count >= av_set->max_array_size)
                goto err3;

            av_set->fi_addr_array[av_set->fi_addr_count++] = i;
        }
    }

fixed this for me.

Alternatively for anyone looking for a workaround at the application layer, you can simple create the set with the address of the root node then add in the other addresses after

struct fi_av_set_attr av_set_attr = {0};
av_set_attr.count = num_remote_nodes;
av_set_attr.start_addr = remote_addrs[node_map[0]];
av_set_attr.end_addr = remote_addrs[node_map[0]];
av_set_attr.stride = 1;
av_set_attr.flags = 0;

struct fid_av_set *av_set;
int ret = fi_av_set(av, &av_set_attr, &av_set, NULL); //av initialized else where
if (ret) {
  printf("error: %s\n",fi_strerror(ret));
}

for (int i=1;i < num_nodes; i++){ //Notice we now start at 1 instead of 0
  ret = fi_av_set_insert(av_set,remote_addrs[node_map[i]]); //node_maps takes a "collective" node id and maps it to its world id
  ...
}

While this isn't necessarily a blocking bug, it does seem to be at odds with what the documentation says for fi_av_set (or I am misinterpreting the documentation) namely:

fi_av_set
start_addr / end_addr
The starting and ending addresses, inclusive, to include as part of the AV set. The use of start and end address require that the associated AV have been created as type FI_AV_TABLE. Valid addresses in the AV which fall within the specified range and which meet other requirements (such as stride) will be added as initial members to the AV set. The start_addr and end_addr must be set to FI_ADDR_NOTAVAIL if creating an empty AV set, a communication key is being provided, or the AV is of type FI_AV_MAP.
The number of addresses between start_addr and end_addr must be less than or equal to the specified count value.

stride
The number of entries between successive addresses included in the AV set. The AV set will include all addresses from start_addr + stride x i, for increasing, non-negative, integer values of i, up to end_addr. A stride of 1 indicates that all addresses between start_addr and end_addr should be added to the AV set. Stride should be set to 0 unless the start_addr and end_addr fields are valid

Based on the emphasized portion of the docs above is what lead to my initial attempt.

Thanks!
-Ryan

Environment:
Linux, verbs;ofi_rxm, v1.20

@rdfriese rdfriese added the bug label Jan 17, 2024
@j-xiong
Copy link
Contributor

j-xiong commented Jan 20, 2024

Thanks for reporting this. I confirm this is a bug. Some checking for the parameters is needed in order to properly handle FI_ADDR_NOTAVAIL, including the case for empty set and cases for invalid combinations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants