Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] ItemSampler CPU usage too high, especially hetero case. #7315

Open
mfbalin opened this issue Apr 18, 2024 · 20 comments
Open

[GraphBolt] ItemSampler CPU usage too high, especially hetero case. #7315

mfbalin opened this issue Apr 18, 2024 · 20 comments
Assignees
Labels
Work Item Work items tracked in project tracker

Comments

@mfbalin
Copy link
Collaborator

mfbalin commented Apr 18, 2024

馃敤Work Item

IMPORTANT:

  • This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
  • DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

When running the hetero graphbolt example in the pureGPU mode, the CPU utilization is very high. (4000%)

image
image

Depending work items or issues

@mfbalin mfbalin added the Work Item Work items tracked in project tracker label Apr 18, 2024
@Rhett-Ying
Copy link
Collaborator

As @mfbalin mentioned, specific logic for ItemsetDict could be the culprit.

@mfbalin mfbalin changed the title [GraphBolt] ItemSampler hetero example uses too much CPU [GraphBolt] MiniBatch.blocks uses the CPU when GPU sampling. Apr 18, 2024
@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 18, 2024

It now looks like dgl.create_block is the culprit.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 18, 2024

dgl/heterograph.py:6407 make_canonical_edges uses numpy for some ops.

@Rhett-Ying
Copy link
Collaborator

num_nodes_per_type = utils.toindex(

@Rhett-Ying
Copy link
Collaborator

@peizhou001
can dgl.create_block() purely run on GPU or it can only run purely on CPU instead? I remember you looked into it previously.

@mfbalin mfbalin changed the title [GraphBolt] MiniBatch.blocks uses the CPU when GPU sampling. [GraphBolt][CUDA] CPU usage too high for hetero case. Apr 18, 2024
@Rhett-Ying
Copy link
Collaborator

@mfbalin tried to bypass the whole forward including data.blocks and CPU usage is still high. So the create_blocks() is probably not the culprit.

@frozenbugs frozenbugs added this to the 2024 Graphbolt Misc milestone Apr 18, 2024
@mfbalin mfbalin changed the title [GraphBolt][CUDA] CPU usage too high for hetero case. [GraphBolt][CUDA] CPU usage too high Apr 19, 2024
@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

Update: CPU usage high even for the homo examples. Some recent change might have caused us to utilize the CPU even in the pureGPU mode. @frozenbugs do you think it could be the logic to move MiniBatch to device?

@mfbalin mfbalin changed the title [GraphBolt][CUDA] CPU usage too high [GraphBolt][CUDA][Regression] CPU usage too high Apr 19, 2024
@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

Or could it possibly be one of my recent changes, such as #7312?

Oh the code in #7312 does not run in the homo case.

I am going to bisect to see if I can identify a commit that causes this issue.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

git checkout 78df81015a9a6cdaa4843167b1d000f4ca377ca9

This commit does not have the issue. Somewhere between current master and the reported commit above, there was a change that cause CPU util on the GPU code path.

@mfbalin mfbalin added the Release Blocker Issues that blocks release label Apr 19, 2024
@Rhett-Ying
Copy link
Collaborator

Rhett-Ying commented Apr 19, 2024

git checkout 78df81015a9a6cdaa4843167b1d000f4ca377ca9

This commit does not have the issue. Somewhere between current master and the reported commit above, there was a change that cause CPU util on the GPU code path.

could be #7309
@yxy235 could you help look into it? Reproduce and confirm it?

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

Easiest way to test is to run python examples/sampling/graphbolt/pyg/node_classification_advanced.py --torch-compile --mode=cuda-cuda-cuda. There is upto %30 regression.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

Transfered attr list:

['blocks', 'compacted_negative_dsts', 'compacted_negative_srcs', 'compacted_node_pairs', 'compacted_seeds', 'edge_features', 'indexes', 'input_nodes', 'labels', 'negative_dsts', 'negative_node_pairs', 'negative_srcs', 'node_features', 'node_pairs', 'node_pairs_with_labels', 'positive_node_pairs', 'sampled_subgraphs', 'seed_nodes', 'seeds']

compacted_negative_dsts
compacted_negative_srcs
compacted_node_pairs
compacted_seeds
edge_features
indexes
input_nodes
labels
negative_dsts
negative_srcs
node_features
node_pairs
sampled_subgraphs
seed_nodes
seeds

Actually transfered by calling .to:

input_nodes
labels
seeds

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

Looks like blocks is called inside MiniBatch.to() even for the pyg example.

@yxy235
Copy link
Collaborator

yxy235 commented Apr 19, 2024

Transfered attr list:

['blocks', 'compacted_negative_dsts', 'compacted_negative_srcs', 'compacted_node_pairs', 'compacted_seeds', 'edge_features', 'indexes', 'input_nodes', 'labels', 'negative_dsts', 'negative_node_pairs', 'negative_srcs', 'node_features', 'node_pairs', 'node_pairs_with_labels', 'positive_node_pairs', 'sampled_subgraphs', 'seed_nodes', 'seeds']

compacted_negative_dsts
compacted_negative_srcs
compacted_node_pairs
compacted_seeds
edge_features
indexes
input_nodes
labels
negative_dsts
negative_srcs
node_features
node_pairs
sampled_subgraphs
seed_nodes
seeds

Actually transfered by calling .to:

input_nodes
labels
seeds

I see. Do you think we need a check when call Minibatch.to()?

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

I figured it out. When we filter which attributes to transfer, we end up calling blocks property. Making a quick patch now.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

CPU usage still higher than 100% though, so I am not sure if I resolved the whole issue.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

Even with #7330, we need to investigate where the high CPU usage comes from. CPU usage is 800% for our main pure-gpu (--mode=cuda-cuda) node classification example.

@mfbalin mfbalin removed the Release Blocker Issues that blocks release label Apr 19, 2024
@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 19, 2024

hetero example CPU usage is still 4000%

@mfbalin mfbalin changed the title [GraphBolt][CUDA][Regression] CPU usage too high [GraphBolt][CUDA][Regression] CPU usage too high, especially hetero case. Apr 19, 2024
@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 20, 2024

@Rhett-Ying Here, we can find see the last iterations of training dataloader for the hetero example. Since we have a prefetcher thread with a buffer size 2, the last 2 iterations don't have excessive CPU utilization as the computation for the last 2 iterations has already finished. This indicates that the high CPU utilization is due to the ItemSampler.

        # (4) Cut datapipe at CopyTo and wrap with prefetcher. This enables the
        # data pipeline up to the CopyTo operation to run in a separate thread.
        datapipe_graph = _find_and_wrap_parent(
            datapipe_graph,
            CopyTo,
            dp.iter.Prefetcher,
            buffer_size=2,
        )

image

@mfbalin mfbalin changed the title [GraphBolt][CUDA][Regression] CPU usage too high, especially hetero case. [GraphBolt][CUDA][Regression] ItemSampler CPU usage too high, especially hetero case. Apr 20, 2024
@mfbalin mfbalin changed the title [GraphBolt][CUDA][Regression] ItemSampler CPU usage too high, especially hetero case. [GraphBolt] ItemSampler CPU usage too high, especially hetero case. Apr 20, 2024
@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 20, 2024

Users with multiple GPUs may not be able to utilize the GPUs effectively due to potential CPU bottleneck.

@mfbalin mfbalin removed their assignment Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Work Item Work items tracked in project tracker
Projects
Status: 馃彔 Backlog
Development

No branches or pull requests

4 participants