You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to create an empty av set of collective operations, where I will later add remote addresses (which are more complex than (start_addr..end_addr).step_by(stride) )
I am setting an av_set_attr and trying to create the av set as follows:
struct fi_av_set_attr av_set_attr = {0};
av_set_attr.count = num_remote_nodes;
av_set_attr.start_addr = FI_ADDR_NOTAVAIL;
av_set_attr.end_addr = FI_ADDR_NOTAVAIL;
av_set_attr.stride = 0;
av_set_attr.flags = 0;
struct fid_av_set *av_set;
int ret = fi_av_set(av, &av_set_attr, &av_set, NULL); //av initialized else where
if (ret) {
printf("error: %s\n",fi_strerror(ret));
}
for (int i=0;i < num_nodes; i++){
ret = fi_av_set_insert(av_set,remote_addrs[node_map[i]]); //node_maps takes a "collective" node id and maps it to its world id
...
}
This returns the following error: fi_av_st failed: Cannot allocate memory (-12)
Digging into this a little deeper it appears that in
for (i=attr->start_addr; i <= attr->end_addr; i+=attr->stride) {
The for loop is incorrectly entered because start_addr = end_addr = FI_ADDR_NOTAVAIL , and then never exits the loop because stride=0
A simple check before the loop:
if (attr->start_addr != FI_ADDR_NOTAVAIL && attr->end_addr != FI_ADDR_NOTAVAIL) {
for (i = attr->start_addr; i <= attr->end_addr; i += attr->stride) {
if (av_set->fi_addr_count >= av_set->max_array_size)
goto err3;
av_set->fi_addr_array[av_set->fi_addr_count++] = i;
}
}
fixed this for me.
Alternatively for anyone looking for a workaround at the application layer, you can simple create the set with the address of the root node then add in the other addresses after
struct fi_av_set_attr av_set_attr = {0};
av_set_attr.count = num_remote_nodes;
av_set_attr.start_addr = remote_addrs[node_map[0]];
av_set_attr.end_addr = remote_addrs[node_map[0]];
av_set_attr.stride = 1;
av_set_attr.flags = 0;
struct fid_av_set *av_set;
int ret = fi_av_set(av, &av_set_attr, &av_set, NULL); //av initialized else where
if (ret) {
printf("error: %s\n",fi_strerror(ret));
}
for (int i=1;i < num_nodes; i++){ //Notice we now start at 1 instead of 0
ret = fi_av_set_insert(av_set,remote_addrs[node_map[i]]); //node_maps takes a "collective" node id and maps it to its world id
...
}
While this isn't necessarily a blocking bug, it does seem to be at odds with what the documentation says for fi_av_set (or I am misinterpreting the documentation) namely:
fi_av_set
start_addr / end_addr
The starting and ending addresses, inclusive, to include as part of the AV set. The use of start and end address require that the associated AV have been created as type FI_AV_TABLE. Valid addresses in the AV which fall within the specified range and which meet other requirements (such as stride) will be added as initial members to the AV set. The start_addr and end_addr must be set to FI_ADDR_NOTAVAIL if creating an empty AV set, a communication key is being provided, or the AV is of type FI_AV_MAP.
The number of addresses between start_addr and end_addr must be less than or equal to the specified count value.
stride
The number of entries between successive addresses included in the AV set. The AV set will include all addresses from start_addr + stride x i, for increasing, non-negative, integer values of i, up to end_addr. A stride of 1 indicates that all addresses between start_addr and end_addr should be added to the AV set. Stride should be set to 0 unless the start_addr and end_addr fields are valid
Based on the emphasized portion of the docs above is what lead to my initial attempt.
Thanks!
-Ryan
Environment:
Linux, verbs;ofi_rxm, v1.20
The text was updated successfully, but these errors were encountered:
Thanks for reporting this. I confirm this is a bug. Some checking for the parameters is needed in order to properly handle FI_ADDR_NOTAVAIL, including the case for empty set and cases for invalid combinations.
I am trying to create an empty av set of collective operations, where I will later add remote addresses (which are more complex than
(start_addr..end_addr).step_by(stride)
)I am setting an av_set_attr and trying to create the av set as follows:
This returns the following error:
fi_av_st failed: Cannot allocate memory (-12)
Digging into this a little deeper it appears that in
libfabric/prov/coll/src/coll_av_set.c
Line 238 in 39e35c0
The for loop is incorrectly entered because
start_addr = end_addr = FI_ADDR_NOTAVAIL
, and then never exits the loop becausestride=0
A simple check before the loop:
fixed this for me.
Alternatively for anyone looking for a workaround at the application layer, you can simple create the set with the address of the root node then add in the other addresses after
While this isn't necessarily a blocking bug, it does seem to be at odds with what the documentation says for
fi_av_set
(or I am misinterpreting the documentation) namely:fi_av_set
start_addr / end_addr
The starting and ending addresses, inclusive, to include as part of the AV set. The use of start and end address require that the associated AV have been created as type FI_AV_TABLE. Valid addresses in the AV which fall within the specified range and which meet other requirements (such as stride) will be added as initial members to the AV set. The start_addr and end_addr must be set to FI_ADDR_NOTAVAIL if creating an empty AV set, a communication key is being provided, or the AV is of type FI_AV_MAP.
The number of addresses between start_addr and end_addr must be less than or equal to the specified count value.
stride
The number of entries between successive addresses included in the AV set. The AV set will include all addresses from start_addr + stride x i, for increasing, non-negative, integer values of i, up to end_addr. A stride of 1 indicates that all addresses between start_addr and end_addr should be added to the AV set. Stride should be set to 0 unless the start_addr and end_addr fields are valid
Based on the emphasized portion of the docs above is what lead to my initial attempt.
Thanks!
-Ryan
Environment:
Linux, verbs;ofi_rxm, v1.20
The text was updated successfully, but these errors were encountered: