You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you please add more details why the current model doesn't work for comm ops?
decoupling output_bytes_accessed and bytes_accessed,which bytes_accessed shoud be operand size bytes. not put them together
There is also operand_bytes_accessed. And bytes_accessed is just sum(operand_bytes_accessed) + output_bytes_accessed.
We don't use bytes_accessed in gpu_performance_model.cc, but I see uses in other parts of the codebase and I can't predict all the implications if we change semantics of the field.
@olegshyshkov OK,I understand~
so the second feature, do we need more op support like allgather/reducescatter/alltoall cost analysis? I have finish a PR to do this
@olegshyshkov OK,I understand~ so the second feature, do we need more op support like allgather/reducescatter/alltoall cost analysis? I have finish a PR to do this
Yes, having more support for those ops will be good if you can add it.
such as allgather/reducescatter/alltoall, which is commonly usely on fsdp/moe model
my suggestion:
output_bytes_accessed
andbytes_accessed
,whichbytes_accessed
shoud be operand size bytes. not put them togethergpu_collective_performance_model.cc
;we can make a table to show how many bytes will send inter nodes and inner nodes
I have prepared a PR for them, is this Feature Request needed?
additional question:
Can I get which gpu is inner node gpu? use nvml api?
The text was updated successfully, but these errors were encountered: