Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]Add more comm op support on gpu_hlo_cost_analysis #11934

Open
zjjott opened this issue Apr 28, 2024 · 4 comments
Open

[Feature Request]Add more comm op support on gpu_hlo_cost_analysis #11934

zjjott opened this issue Apr 28, 2024 · 4 comments

Comments

@zjjott
Copy link
Contributor

zjjott commented Apr 28, 2024

such as allgather/reducescatter/alltoall, which is commonly usely on fsdp/moe model
my suggestion:

  1. decoupling output_bytes_accessed and bytes_accessed,which bytes_accessed shoud be operand size bytes. not put them together
  2. add allgather/reducescatter/alltoall support
  3. support upper op on gpu_collective_performance_model.cc;

we can make a table to show how many bytes will send inter nodes and inner nodes

I have prepared a PR for them, is this Feature Request needed?

additional question:
Can I get which gpu is inner node gpu? use nvml api?

@zjjott
Copy link
Contributor Author

zjjott commented May 6, 2024

@Tixxx @olegshyshkov

@olegshyshkov
Copy link
Contributor

Could you please add more details why the current model doesn't work for comm ops?

decoupling output_bytes_accessed and bytes_accessed,which bytes_accessed shoud be operand size bytes. not put them together

There is also operand_bytes_accessed. And bytes_accessed is just sum(operand_bytes_accessed) + output_bytes_accessed.

We don't use bytes_accessed in gpu_performance_model.cc, but I see uses in other parts of the codebase and I can't predict all the implications if we change semantics of the field.

@zjjott
Copy link
Contributor Author

zjjott commented May 7, 2024

@olegshyshkov OK,I understand~
so the second feature, do we need more op support like allgather/reducescatter/alltoall cost analysis? I have finish a PR to do this

@Tixxx
Copy link
Contributor

Tixxx commented May 7, 2024

@olegshyshkov OK,I understand~ so the second feature, do we need more op support like allgather/reducescatter/alltoall cost analysis? I have finish a PR to do this

Yes, having more support for those ops will be good if you can add it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants