Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/WIREUP: Increase UCP_MAX_LANES to 64 #9814

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ivankochin
Copy link
Contributor

What

Increase number of maximal lanes per EP to 64.

Why ?

To allow collecting information about all lanes on systems with many transports/devices.

@ivankochin ivankochin self-assigned this Apr 11, 2024
@edgargabriel
Copy link
Contributor

Would it make sense to have this parameter as a configure option?

@yosefe
Copy link
Contributor

yosefe commented Apr 15, 2024

Would it make sense to have this parameter as a configure option?

I think not, we should be able to find a solution to set upper limit to 64 w/o extra overheads when the actual number of lanes is small

src/ucs/sys/compiler_def.h Outdated Show resolved Hide resolved
@ivankochin ivankochin marked this pull request as draft April 15, 2024 10:14
Comment on lines 103 to 101

ucs_assertv(rpriv->spriv.super.lane != UCP_NULL_LANE, "lane=%d",
rpriv->spriv.super.lane);
attr->lane_map = UCS_BIT(rpriv->spriv.super.lane);
if (rpriv->ack.lane != UCP_NULL_LANE) {
attr->lane_map |= UCS_BIT(rpriv->ack.lane);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is required since ack.lane can be equal to 255 (UCP_NULL_LANE) so after increasing lane_map size to 64 it started to set 63rd bit in lane_map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did it set 63d bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think (1ul << 256) is anyway undefined behavior as 256 >= 64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did it set 63d bit?

To be honest, I don't know. I just detected this behavior and when I applied this change it gone. Do you want me to check it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was experimenting on similar case and got unpredictable result, i guess this should be ok

@ivankochin ivankochin marked this pull request as ready for review April 19, 2024 06:32
src/ucp/wireup/select.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_ep.inl Show resolved Hide resolved
src/ucs/sys/compiler_def.h Outdated Show resolved Hide resolved
src/ucp/core/ucp_ep.inl Show resolved Hide resolved
src/ucp/proto/proto.h Outdated Show resolved Hide resolved
src/ucp/rndv/rndv_rkey_ptr.c Outdated Show resolved Hide resolved
@@ -57,7 +57,7 @@ UCS_TEST_F(test_obj_size, size) {
EXPECTED_SIZE(ucp_rkey_t, 32);
#endif
/* TODO reduce request size to 240 or less after removing old protocols state */
EXPECTED_SIZE(ucp_request_t, 264);
EXPECTED_SIZE(ucp_request_t, 280);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can consider increasing number of lanes to 32 instead of 64. Or allowing creation of 64 lanes only for wireup and restrict the max lanes number by 32 or 16 for RNDV protocols. What sounds better for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both sounds ok. Maybe it is better to restrict protocol;s to use up to 16 lanes? Then no need to keep large maps in the request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pr #9862 should address it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9862 is not enough, also #9864 is required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request size was rolled back.

Comment on lines 1463 to 1464
ASSERT_LE(16, (int)ucp_ep_num_lanes(sender().ep()));
ASSERT_LE(16, (int)ucp_ep_num_lanes(receiver().ep()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have some platform which can create more than 16 lanes in that particular case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest to change this test or wtite a new one which tests creation of 64 lanes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1247,19 +1247,19 @@ UCS_TEST_P(multi_rail_max, max_lanes, "IB_NUM_PATHS?=16", "TM_SW_RNDV=y",

ucp_lane_index_t num_lanes = ucp_ep_num_lanes(sender().ep());
ASSERT_EQ(ucp_ep_num_lanes(receiver().ep()), num_lanes);
ASSERT_EQ(num_lanes, max_lanes);
ASSERT_GE(num_lanes, max_lanes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have some platform which can create more than 16 lanes in that particular case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, should try creating 64 lanes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to revert now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -239,7 +239,7 @@ static inline ucp_rsc_index_t
ucp_ep_config_get_dst_md_cmpt(const ucp_ep_config_key_t *key,
ucp_md_index_t dst_md_index)
{
unsigned idx = ucs_popcount(key->reachable_md_map & UCS_MASK(dst_md_index));
unsigned idx = ucs_bitmap2idx(key->reachable_md_map, dst_md_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a little :)

I just discovered that there is a helper function for that logic, so decided to included it to the patch.

Comment on lines 103 to 101

ucs_assertv(rpriv->spriv.super.lane != UCP_NULL_LANE, "lane=%d",
rpriv->spriv.super.lane);
attr->lane_map = UCS_BIT(rpriv->spriv.super.lane);
if (rpriv->ack.lane != UCP_NULL_LANE) {
attr->lane_map |= UCS_BIT(rpriv->ack.lane);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did it set 63d bit?

tvegas1
tvegas1 previously approved these changes May 1, 2024
@@ -129,6 +130,7 @@ class mem_buffer {
size_t length, size_t offset,
const void *buffer, const void *orig_ptr)
{
ucs_assertv(length <= 8, "length=%zu", length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since new implementation of UCS_MASK will still do undefined behavior if input value > 64. In that particular case length is size of one element that should be compared.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can define UCS_MASK the same was UCS_MASK_SAFE is defined and then no need for this..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use safe implementation of mask for all the types' sizes we need to define separate macro for each type (like `UCS_MASK8, UCS_MASK16, UCS_MASK32, etc...) and each of that macros will be capped by the full mask for particular type size.

Otherwise, in case of one common UCS_MASK macro there always would be some places when too big value can be passed to mask as input parameter and therefore undefined behavior would happen. UCS_MASK_SAFE old definition can work only with 64-bits output values, since it issues compilation error for smaller ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make UCS_MASK always 64-bit, and if the result is casted to smaller type it will still be ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make UCS_MASK always 64-bit, and if the result is casted to smaller type it will still be ok?

Can you show example of the compilation error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's example of compilation error on Ubuntu 20.04 for macro implementation with UINT64_MAX:

/__w/2/s/contrib/../src/uct/ib/mlx5/ib_mlx5.c:218:26: error: implicit conversion from 'unsigned long' to 'unsigned int' changes value from 18446744073709551615 to 4294967295 [-Werror,-Wconstant-conversion]
    cq->cq_length_mask = UCS_MASK(cq->cq_length_log);
                       ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/__w/2/s/contrib/../src/ucs/sys/compiler_def.h:83:38: note: expanded from macro 'UCS_MASK'
#define UCS_MASK(_i) (((_i) >= 64) ? 0xFFFFFFFFFFFFFFFF : (UCS_BIT(_i) - 1))
                                    ^~~~~~~~~~~~~~~~~~

I will try to implement it through inline function and see whether error would happen then.

But anyway, from my point of view we are walking on a very thin ice here, since even if it would work in our CI, it can break something on some tricky platform. So I would prefer to use separate implementations for UCS_MASK and UCS_MASK64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would using #define UCS_MASK(_i) (((_i) >= 64) ? ~0 : (UCS_BIT(_i) - 1)) fix all cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tvegas1 thanks for the suggestion!

I checked C standard and that what does it say about conditional operator:

If both the second and third operands have arithmetic type, the result type that would be
determined by the usual arithmetic conversions, were they applied to those two operands,
is the type of the result. If both the operands have structure or union type, the result has
that type. If both operands have void type, the result has void type.

So if I understand that correctly, it means that ~0 will be casted to the type that is used as a left operand of assignment.

src/ucp/rma/flush.c Outdated Show resolved Hide resolved
src/ucp/rma/flush.c Outdated Show resolved Hide resolved
src/ucp/rma/flush.c Outdated Show resolved Hide resolved
src/ucs/sys/compiler_def.h Outdated Show resolved Hide resolved
@@ -129,6 +130,7 @@ class mem_buffer {
size_t length, size_t offset,
const void *buffer, const void *orig_ptr)
{
ucs_assertv(length <= 8, "length=%zu", length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can define UCS_MASK the same was UCS_MASK_SAFE is defined and then no need for this..

@@ -57,7 +57,7 @@ UCS_TEST_F(test_obj_size, size) {
EXPECTED_SIZE(ucp_rkey_t, 32);
#endif
/* TODO reduce request size to 240 or less after removing old protocols state */
EXPECTED_SIZE(ucp_request_t, 264);
EXPECTED_SIZE(ucp_request_t, 280);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pr #9862 should address it

@@ -1247,19 +1247,19 @@ UCS_TEST_P(multi_rail_max, max_lanes, "IB_NUM_PATHS?=16", "TM_SW_RNDV=y",

ucp_lane_index_t num_lanes = ucp_ep_num_lanes(sender().ep());
ASSERT_EQ(ucp_ep_num_lanes(receiver().ep()), num_lanes);
ASSERT_EQ(num_lanes, max_lanes);
ASSERT_GE(num_lanes, max_lanes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, should try creating 64 lanes

@@ -25,12 +25,12 @@ ucp_ep_flush_request_update_uct_comp(ucp_request_t *req, int diff,
&req->send.state.uct_comp, req->send.state.uct_comp.count,
diff);
ucs_assertv(!(req->send.flush.started_lanes & new_started_lanes),
"req=%p started_lanes=0x%zx new_started_lanes=0x%zx", req,
req->send.flush.started_lanes, new_started_lanes);
"req=%p started_lanes=0x%"PRIx64" new_started_lanes=0x%"PRIx64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls check clang format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -40,7 +40,7 @@ UCP_UINT_TYPE(UCP_MD_INDEX_BITS) ucp_md_map_t;


/* Lanes */
#define UCP_MAX_LANES 16
#define UCP_MAX_LANES 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably change code that uses ucs_ffs32 on the lane map to use ucs_ffs64
for example in protov2 rndv/put/zcopy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed all the entries that I found, but I haven't found one in rndv/put/zcopy. Maybe they were removed in #9862

src/ucp/proto/proto_init.c Outdated Show resolved Hide resolved
src/ucp/proto/proto_multi.inl Outdated Show resolved Hide resolved
src/ucs/sys/compiler_def.h Outdated Show resolved Hide resolved
src/ucs/sys/compiler_def.h Outdated Show resolved Hide resolved
yosefe
yosefe previously approved these changes May 15, 2024
@ivankochin ivankochin force-pushed the ucp/wireup/increase-max-lanes-to-64 branch from 2346a5b to ffcca4a Compare May 15, 2024 09:46
@ivankochin ivankochin enabled auto-merge May 15, 2024 09:46
brminich
brminich previously approved these changes May 15, 2024
@yosefe
Copy link
Contributor

yosefe commented May 15, 2024

test failure seems relevant (with ASAN):

2024-05-15T11:43:42.5740013Z [ RUN      ] rc/multi_rail_max.max_lanes/7 <rc/proto_v1>
2024-05-15T11:43:43.2815747Z [     INFO ] lane[0] : sender 115 receiver 17
2024-05-15T11:43:43.2817074Z [     INFO ] lane[1] : sender 0 receiver 4608
2024-05-15T11:43:43.2817776Z [     INFO ] lane[2] : sender 0 receiver 4608
2024-05-15T11:43:43.2818286Z [     INFO ] lane[3] : sender 0 receiver 4608
2024-05-15T11:43:43.2818622Z [     INFO ] lane[4] : sender 0 receiver 4608
2024-05-15T11:43:43.2818900Z [     INFO ] lane[5] : sender 0 receiver 4608
2024-05-15T11:43:43.2819178Z [     INFO ] lane[6] : sender 0 receiver 4608
2024-05-15T11:43:43.2819441Z [     INFO ] lane[7] : sender 0 receiver 4608
2024-05-15T11:43:43.2819712Z [     INFO ] lane[8] : sender 0 receiver 4608
2024-05-15T11:43:43.2819985Z [     INFO ] lane[9] : sender 0 receiver 4608
2024-05-15T11:43:43.2820261Z [     INFO ] lane[10] : sender 0 receiver 4608
2024-05-15T11:43:43.2820527Z [     INFO ] lane[11] : sender 0 receiver 4608
2024-05-15T11:43:43.2820803Z [     INFO ] lane[12] : sender 0 receiver 4608
2024-05-15T11:43:43.2821089Z [     INFO ] lane[13] : sender 0 receiver 4608
2024-05-15T11:43:43.2821364Z [     INFO ] lane[14] : sender 0 receiver 4608
2024-05-15T11:43:43.2821619Z [     INFO ] lane[15] : sender 0 receiver 4608
2024-05-15T11:43:43.2821891Z [     INFO ] lane[16] : sender 0 receiver 4608
2024-05-15T11:43:43.2822163Z [     INFO ] lane[17] : sender 0 receiver 4608
2024-05-15T11:43:43.2822434Z [     INFO ] lane[18] : sender 0 receiver 4608
2024-05-15T11:43:43.2822707Z [     INFO ] lane[19] : sender 0 receiver 4608
2024-05-15T11:43:43.2822967Z [     INFO ] lane[20] : sender 0 receiver 4608
2024-05-15T11:43:43.2823339Z [     INFO ] lane[21] : sender 0 receiver 4608
2024-05-15T11:43:43.2823692Z [     INFO ] lane[22] : sender 0 receiver 4608
2024-05-15T11:43:43.2823979Z [     INFO ] lane[23] : sender 0 receiver 4608
2024-05-15T11:43:43.2824239Z [     INFO ] lane[24] : sender 0 receiver 4608
2024-05-15T11:43:43.2824508Z [     INFO ] lane[25] : sender 0 receiver 4608
2024-05-15T11:43:43.2825385Z [     INFO ] lane[26] : sender 0 receiver 4608
2024-05-15T11:43:43.2825933Z [     INFO ] lane[27] : sender 0 receiver 4608
2024-05-15T11:43:43.2826546Z [     INFO ] lane[28] : sender 0 receiver 4608
2024-05-15T11:43:43.2827195Z [     INFO ] lane[29] : sender 0 receiver 4608
2024-05-15T11:43:43.2827726Z [     INFO ] lane[30] : sender 0 receiver 4608
2024-05-15T11:43:43.2828259Z [     INFO ] lane[31] : sender 0 receiver 4608
2024-05-15T11:43:43.2828766Z [     INFO ] lane[32] : sender 0 receiver 4608
2024-05-15T11:43:43.2829287Z [     INFO ] lane[33] : sender 0 receiver 4608
2024-05-15T11:43:43.2829817Z [     INFO ] lane[34] : sender 0 receiver 4608
2024-05-15T11:43:43.2830376Z [     INFO ] lane[35] : sender 0 receiver 4608
2024-05-15T11:43:43.2830911Z [     INFO ] lane[36] : sender 0 receiver 4608
2024-05-15T11:43:43.2831429Z [     INFO ] lane[37] : sender 0 receiver 4608
2024-05-15T11:43:43.2831966Z [     INFO ] lane[38] : sender 0 receiver 4608
2024-05-15T11:43:43.2832500Z [     INFO ] lane[39] : sender 0 receiver 4608
2024-05-15T11:43:43.2833047Z [     INFO ] lane[40] : sender 0 receiver 4608
2024-05-15T11:43:43.2833579Z [     INFO ] lane[41] : sender 0 receiver 4608
2024-05-15T11:43:43.2834116Z [     INFO ] lane[42] : sender 0 receiver 4608
2024-05-15T11:43:43.2834656Z [     INFO ] lane[43] : sender 0 receiver 4608
2024-05-15T11:43:43.2835244Z [     INFO ] lane[44] : sender 0 receiver 4608
2024-05-15T11:43:43.2835780Z [     INFO ] lane[45] : sender 0 receiver 4608
2024-05-15T11:43:43.2836315Z [     INFO ] lane[46] : sender 0 receiver 4608
2024-05-15T11:43:43.2836852Z [     INFO ] lane[47] : sender 0 receiver 4608
2024-05-15T11:43:43.2837385Z [     INFO ] lane[48] : sender 0 receiver 4608
2024-05-15T11:43:43.2837919Z [     INFO ] lane[49] : sender 0 receiver 4608
2024-05-15T11:43:43.2838443Z [     INFO ] lane[50] : sender 0 receiver 4608
2024-05-15T11:43:43.2838981Z [     INFO ] lane[51] : sender 0 receiver 4608
2024-05-15T11:43:43.2839630Z [     INFO ] lane[52] : sender 0 receiver 4608
2024-05-15T11:43:43.2840165Z [     INFO ] lane[53] : sender 0 receiver 4608
2024-05-15T11:43:43.2840683Z [     INFO ] lane[54] : sender 0 receiver 4608
2024-05-15T11:43:43.2841231Z [     INFO ] lane[55] : sender 0 receiver 4608
2024-05-15T11:43:43.2841762Z [     INFO ] lane[56] : sender 0 receiver 4608
2024-05-15T11:43:43.2842291Z [     INFO ] lane[57] : sender 0 receiver 4096
2024-05-15T11:43:43.2842807Z [     INFO ] lane[58] : sender 0 receiver 0
2024-05-15T11:43:43.2843428Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2844103Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2844668Z [     INFO ] lane[59] : sender 0 receiver 0
2024-05-15T11:43:43.2845286Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2845874Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2846440Z [     INFO ] lane[60] : sender 0 receiver 0
2024-05-15T11:43:43.2847009Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2847603Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2848146Z [     INFO ] lane[61] : sender 0 receiver 0
2024-05-15T11:43:43.2848716Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2849306Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2849864Z [     INFO ] lane[62] : sender 0 receiver 0
2024-05-15T11:43:43.2850427Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2851004Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2851566Z [     INFO ] lane[63] : sender 0 receiver 0
2024-05-15T11:43:43.2852131Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1263: Failure
2024-05-15T11:43:43.2852709Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0
2024-05-15T11:43:43.6603087Z [  FAILED  ] rc/multi_rail_max.max_lanes/7, where GetParam() = rc/proto_v1 (1096 ms)

@ivankochin ivankochin disabled auto-merge May 16, 2024 03:42
@ivankochin
Copy link
Contributor Author

ivankochin commented May 16, 2024

test failure seems relevant (with ASAN):

2024-05-15T11:43:42.5740013Z [ RUN      ] rc/multi_rail_max.max_lanes/7 <rc/proto_v1>
2024-05-15T11:43:43.2815747Z [     INFO ] lane[0] : sender 115 receiver 17
2024-05-15T11:43:43.2817074Z [     INFO ] lane[1] : sender 0 receiver 4608
2024-05-15T11:43:43.2817776Z [     INFO ] lane[2] : sender 0 receiver 4608
2024-05-15T11:43:43.2818286Z [     INFO ] lane[3] : sender 0 receiver 4608
2024-05-15T11:43:43.2818622Z [     INFO ] lane[4] : sender 0 receiver 4608
2024-05-15T11:43:43.2818900Z [     INFO ] lane[5] : sender 0 receiver 4608
2024-05-15T11:43:43.2819178Z [     INFO ] lane[6] : sender 0 receiver 4608
2024-05-15T11:43:43.2819441Z [     INFO ] lane[7] : sender 0 receiver 4608
2024-05-15T11:43:43.2819712Z [     INFO ] lane[8] : sender 0 receiver 4608
2024-05-15T11:43:43.2819985Z [     INFO ] lane[9] : sender 0 receiver 4608
2024-05-15T11:43:43.2820261Z [     INFO ] lane[10] : sender 0 receiver 4608
2024-05-15T11:43:43.2820527Z [     INFO ] lane[11] : sender 0 receiver 4608
2024-05-15T11:43:43.2820803Z [     INFO ] lane[12] : sender 0 receiver 4608
2024-05-15T11:43:43.2821089Z [     INFO ] lane[13] : sender 0 receiver 4608
2024-05-15T11:43:43.2821364Z [     INFO ] lane[14] : sender 0 receiver 4608
2024-05-15T11:43:43.2821619Z [     INFO ] lane[15] : sender 0 receiver 4608
2024-05-15T11:43:43.2821891Z [     INFO ] lane[16] : sender 0 receiver 4608
2024-05-15T11:43:43.2822163Z [     INFO ] lane[17] : sender 0 receiver 4608
2024-05-15T11:43:43.2822434Z [     INFO ] lane[18] : sender 0 receiver 4608
2024-05-15T11:43:43.2822707Z [     INFO ] lane[19] : sender 0 receiver 4608
2024-05-15T11:43:43.2822967Z [     INFO ] lane[20] : sender 0 receiver 4608
2024-05-15T11:43:43.2823339Z [     INFO ] lane[21] : sender 0 receiver 4608
2024-05-15T11:43:43.2823692Z [     INFO ] lane[22] : sender 0 receiver 4608
2024-05-15T11:43:43.2823979Z [     INFO ] lane[23] : sender 0 receiver 4608
2024-05-15T11:43:43.2824239Z [     INFO ] lane[24] : sender 0 receiver 4608
2024-05-15T11:43:43.2824508Z [     INFO ] lane[25] : sender 0 receiver 4608
2024-05-15T11:43:43.2825385Z [     INFO ] lane[26] : sender 0 receiver 4608
2024-05-15T11:43:43.2825933Z [     INFO ] lane[27] : sender 0 receiver 4608
2024-05-15T11:43:43.2826546Z [     INFO ] lane[28] : sender 0 receiver 4608
2024-05-15T11:43:43.2827195Z [     INFO ] lane[29] : sender 0 receiver 4608
2024-05-15T11:43:43.2827726Z [     INFO ] lane[30] : sender 0 receiver 4608
2024-05-15T11:43:43.2828259Z [     INFO ] lane[31] : sender 0 receiver 4608
2024-05-15T11:43:43.2828766Z [     INFO ] lane[32] : sender 0 receiver 4608
2024-05-15T11:43:43.2829287Z [     INFO ] lane[33] : sender 0 receiver 4608
2024-05-15T11:43:43.2829817Z [     INFO ] lane[34] : sender 0 receiver 4608
2024-05-15T11:43:43.2830376Z [     INFO ] lane[35] : sender 0 receiver 4608
2024-05-15T11:43:43.2830911Z [     INFO ] lane[36] : sender 0 receiver 4608
2024-05-15T11:43:43.2831429Z [     INFO ] lane[37] : sender 0 receiver 4608
2024-05-15T11:43:43.2831966Z [     INFO ] lane[38] : sender 0 receiver 4608
2024-05-15T11:43:43.2832500Z [     INFO ] lane[39] : sender 0 receiver 4608
2024-05-15T11:43:43.2833047Z [     INFO ] lane[40] : sender 0 receiver 4608
2024-05-15T11:43:43.2833579Z [     INFO ] lane[41] : sender 0 receiver 4608
2024-05-15T11:43:43.2834116Z [     INFO ] lane[42] : sender 0 receiver 4608
2024-05-15T11:43:43.2834656Z [     INFO ] lane[43] : sender 0 receiver 4608
2024-05-15T11:43:43.2835244Z [     INFO ] lane[44] : sender 0 receiver 4608
2024-05-15T11:43:43.2835780Z [     INFO ] lane[45] : sender 0 receiver 4608
2024-05-15T11:43:43.2836315Z [     INFO ] lane[46] : sender 0 receiver 4608
2024-05-15T11:43:43.2836852Z [     INFO ] lane[47] : sender 0 receiver 4608
2024-05-15T11:43:43.2837385Z [     INFO ] lane[48] : sender 0 receiver 4608
2024-05-15T11:43:43.2837919Z [     INFO ] lane[49] : sender 0 receiver 4608
2024-05-15T11:43:43.2838443Z [     INFO ] lane[50] : sender 0 receiver 4608
2024-05-15T11:43:43.2838981Z [     INFO ] lane[51] : sender 0 receiver 4608
2024-05-15T11:43:43.2839630Z [     INFO ] lane[52] : sender 0 receiver 4608
2024-05-15T11:43:43.2840165Z [     INFO ] lane[53] : sender 0 receiver 4608
2024-05-15T11:43:43.2840683Z [     INFO ] lane[54] : sender 0 receiver 4608
2024-05-15T11:43:43.2841231Z [     INFO ] lane[55] : sender 0 receiver 4608
2024-05-15T11:43:43.2841762Z [     INFO ] lane[56] : sender 0 receiver 4608
2024-05-15T11:43:43.2842291Z [     INFO ] lane[57] : sender 0 receiver 4096
2024-05-15T11:43:43.2842807Z [     INFO ] lane[58] : sender 0 receiver 0
2024-05-15T11:43:43.2843428Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2844103Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2844668Z [     INFO ] lane[59] : sender 0 receiver 0
2024-05-15T11:43:43.2845286Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2845874Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2846440Z [     INFO ] lane[60] : sender 0 receiver 0
2024-05-15T11:43:43.2847009Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2847603Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2848146Z [     INFO ] lane[61] : sender 0 receiver 0
2024-05-15T11:43:43.2848716Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2849306Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2849864Z [     INFO ] lane[62] : sender 0 receiver 0
2024-05-15T11:43:43.2850427Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1265: Failure
2024-05-15T11:43:43.2851004Z Expected: (sender_tx + receiver_tx) >= (chunk_size), actual: 0 vs 4096
2024-05-15T11:43:43.2851566Z [     INFO ] lane[63] : sender 0 receiver 0
2024-05-15T11:43:43.2852131Z /__w/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1263: Failure
2024-05-15T11:43:43.2852709Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0
2024-05-15T11:43:43.6603087Z [  FAILED  ] rc/multi_rail_max.max_lanes/7, where GetParam() = rc/proto_v1 (1096 ms)

@yosefe It seems related but I tried to reproduce it 100 times and test always ends successfully. I also reran the CI and it passed too. That's very strange, but I think it can be some configurational issue. How do you think, can we merge it now or should we try to reproduce it more?

@yosefe
Copy link
Contributor

yosefe commented May 16, 2024

@yosefe It seems related but I tried to reproduce it 100 times and test always ends successfully. I also reran the CI and it passed too. That's very strange, but I think it can be some configurational issue. How do you think, can we merge it now or should we try to reproduce it more?

One weird thing i see here is we expect 64 lanes with protov1 test. Maybe it's wrong?

@ivankochin
Copy link
Contributor Author

One weird thing i see here is we expect 64 lanes with protov1 test. Maybe it's wrong?

We set MAX_RNDV_LANES=64 in that test case, so if I understand the logic correctly, it is OK to expect 64 lanes.

@yosefe
Copy link
Contributor

yosefe commented May 16, 2024

i think we should limit protov1 tests to 16 lanes, since there may be places in protov1 flows we are not updating to support more lanes

through am

- Enable this test for protov2
- Set max_lanes to 16 for protov1
- Skip testing unsupported case
- Relax requirement for splitting message above lanes
- Enable am bw lanes creation to support multi-lane RNDV through AM
@ivankochin ivankochin dismissed stale reviews from brminich and yosefe via 2f16fe7 May 16, 2024 13:33
@yosefe
Copy link
Contributor

yosefe commented May 17, 2024

/azp run

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@yosefe
Copy link
Contributor

yosefe commented May 19, 2024

/azp run

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants