New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GraphBolt] Add experimental ItemSet/Dict4
and ItemSampler4
#7371
base: master
Are you sure you want to change the base?
Conversation
To trigger regression tests:
|
@Rhett-Ying Benchmark shows that the variation on performance is acceptable. I'am trying to find out a way to enable all replicas to obtain a random seed from the main process instead of letting user manually set it, but this is yet another topic. For now, I think we can merge this PR first. |
|
benchmark on ogbn-productsOld: $ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3
Training with 4 gpus.
The dataset is already preprocessed.
Training...
48it [00:02, 16.06it/s]
Validating...
10it [00:00, 21.67it/s]
Epoch 00000 | Average Loss 2.3267 | Accuracy 0.7917 | Time 3.5637
48it [00:02, 21.37it/s]
Validating...
10it [00:00, 24.19it/s]
Epoch 00001 | Average Loss 0.9559 | Accuracy 0.8437 | Time 2.7528
48it [00:02, 21.33it/s]
Validating...
10it [00:00, 24.10it/s]
Epoch 00002 | Average Loss 0.7238 | Accuracy 0.8602 | Time 2.7597
48it [00:02, 21.33it/s]
Validating...
10it [00:00, 24.51it/s]
Epoch 00003 | Average Loss 0.6163 | Accuracy 0.8706 | Time 2.7502
48it [00:02, 21.45it/s]
Validating...
10it [00:00, 24.45it/s]
Epoch 00004 | Average Loss 0.5578 | Accuracy 0.8762 | Time 2.7404
48it [00:02, 20.19it/s]
Validating...
10it [00:00, 24.57it/s]
Epoch 00005 | Average Loss 0.5176 | Accuracy 0.8819 | Time 2.8776
48it [00:02, 21.50it/s]
Validating...
10it [00:00, 24.13it/s]
Epoch 00006 | Average Loss 0.4883 | Accuracy 0.8855 | Time 2.7396
48it [00:02, 21.42it/s]
Validating...
10it [00:00, 24.41it/s]
Epoch 00007 | Average Loss 0.4667 | Accuracy 0.8881 | Time 2.7437
48it [00:02, 21.31it/s]
Validating...
10it [00:00, 24.19it/s]
Epoch 00008 | Average Loss 0.4477 | Accuracy 0.8889 | Time 2.7596
48it [00:02, 21.46it/s]
Validating...
10it [00:00, 24.29it/s]
Epoch 00009 | Average Loss 0.4343 | Accuracy 0.8920 | Time 2.7416
Testing...
541it [00:19, 27.95it/s]
Test Accuracy 0.7348 New: $ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3
Training with 4 gpus.
The dataset is already preprocessed.
Training...
48it [00:03, 15.84it/s]
Validating...
10it [00:00, 22.02it/s]
Epoch 00000 | Average Loss 2.3048 | Accuracy 0.7777 | Time 3.5975
48it [00:02, 21.28it/s]
Validating...
10it [00:00, 25.05it/s]
Epoch 00001 | Average Loss 0.9804 | Accuracy 0.8388 | Time 2.7448
48it [00:02, 21.31it/s]
Validating...
10it [00:00, 24.98it/s]
Epoch 00002 | Average Loss 0.7427 | Accuracy 0.8587 | Time 2.7464
48it [00:02, 21.43it/s]
Validating...
10it [00:00, 25.03it/s]
Epoch 00003 | Average Loss 0.6308 | Accuracy 0.8696 | Time 2.7333
48it [00:02, 21.40it/s]
Validating...
10it [00:00, 25.19it/s]
Epoch 00004 | Average Loss 0.5623 | Accuracy 0.8785 | Time 2.7332
48it [00:02, 20.29it/s]
Validating...
10it [00:00, 24.69it/s]
Epoch 00005 | Average Loss 0.5228 | Accuracy 0.8815 | Time 2.8657
48it [00:02, 21.37it/s]
Validating...
10it [00:00, 24.89it/s]
Epoch 00006 | Average Loss 0.4937 | Accuracy 0.8850 | Time 2.7418
48it [00:02, 21.41it/s]
Validating...
10it [00:00, 25.01it/s]
Epoch 00007 | Average Loss 0.4696 | Accuracy 0.8879 | Time 2.7378
48it [00:02, 21.36it/s]
Validating...
10it [00:00, 25.03it/s]
Epoch 00008 | Average Loss 0.4537 | Accuracy 0.8909 | Time 2.7409
48it [00:02, 21.40it/s]
Validating...
10it [00:00, 24.88it/s]
Epoch 00009 | Average Loss 0.4388 | Accuracy 0.8932 | Time 2.7407
Testing...
541it [00:19, 27.96it/s]
Test Accuracy 0.7393 ogbn-arxivOld: $ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-arxiv
Training with 4 gpus.
The dataset is already preprocessed.
Training...
22it [00:01, 21.57it/s]
Validating...
8it [00:00, 52.40it/s]
Epoch 00000 | Average Loss 3.2543 | Accuracy 0.3002 | Time 1.2109
22it [00:00, 54.33it/s]
Validating...
8it [00:00, 70.41it/s]
Epoch 00001 | Average Loss 2.5287 | Accuracy 0.4404 | Time 0.5230
22it [00:00, 59.90it/s]
Validating...
8it [00:00, 71.66it/s]
Epoch 00002 | Average Loss 2.1985 | Accuracy 0.5054 | Time 0.4818
22it [00:00, 54.64it/s]
Validating...
8it [00:00, 86.39it/s]
Epoch 00003 | Average Loss 1.9795 | Accuracy 0.5349 | Time 0.4978
22it [00:00, 57.34it/s]
Validating...
8it [00:00, 78.11it/s]
Epoch 00004 | Average Loss 1.8419 | Accuracy 0.5529 | Time 0.4944
22it [00:00, 42.99it/s]
Validating...
8it [00:00, 73.39it/s]
Epoch 00005 | Average Loss 1.7533 | Accuracy 0.5649 | Time 0.6252
22it [00:00, 56.13it/s]
Validating...
8it [00:00, 76.69it/s]
Epoch 00006 | Average Loss 1.6852 | Accuracy 0.5713 | Time 0.5014
22it [00:00, 52.51it/s]
Validating...
8it [00:00, 79.52it/s]
Epoch 00007 | Average Loss 1.6405 | Accuracy 0.5766 | Time 0.5221
22it [00:00, 59.19it/s]
Validating...
8it [00:00, 67.85it/s]
Epoch 00008 | Average Loss 1.6055 | Accuracy 0.5814 | Time 0.4923
22it [00:00, 60.42it/s]
Validating...
8it [00:00, 71.80it/s]
Epoch 00009 | Average Loss 1.5681 | Accuracy 0.5878 | Time 0.4783
Testing...
12it [00:00, 82.86it/s]
Test Accuracy 0.5271 New: $ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-arxiv
Training with 4 gpus.
The dataset is already preprocessed.
Training...
22it [00:01, 18.31it/s]
Validating...
8it [00:00, 54.37it/s]
Epoch 00000 | Average Loss 3.1735 | Accuracy 0.2941 | Time 1.3790
22it [00:00, 58.89it/s]
Validating...
8it [00:00, 78.07it/s]
Epoch 00001 | Average Loss 2.4895 | Accuracy 0.4520 | Time 0.4908
22it [00:00, 56.94it/s]
Validating...
8it [00:00, 73.67it/s]
Epoch 00002 | Average Loss 2.1515 | Accuracy 0.5135 | Time 0.5007
22it [00:00, 54.02it/s]
Validating...
8it [00:00, 69.11it/s]
Epoch 00003 | Average Loss 1.9372 | Accuracy 0.5381 | Time 0.5256
22it [00:00, 56.69it/s]
Validating...
8it [00:00, 70.72it/s]
Epoch 00004 | Average Loss 1.8119 | Accuracy 0.5560 | Time 0.5067
22it [00:00, 39.94it/s]
Validating...
8it [00:00, 74.97it/s]
Epoch 00005 | Average Loss 1.7279 | Accuracy 0.5639 | Time 0.6646
22it [00:00, 56.77it/s]
Validating...
8it [00:00, 79.99it/s]
Epoch 00006 | Average Loss 1.6723 | Accuracy 0.5734 | Time 0.4928
22it [00:00, 60.43it/s]
Validating...
8it [00:00, 71.34it/s]
Epoch 00007 | Average Loss 1.6253 | Accuracy 0.5817 | Time 0.4789
22it [00:00, 58.53it/s]
Validating...
8it [00:00, 91.09it/s]
Epoch 00008 | Average Loss 1.5881 | Accuracy 0.5844 | Time 0.4690
22it [00:00, 56.57it/s]
Validating...
8it [00:00, 77.58it/s]
Epoch 00009 | Average Loss 1.5577 | Accuracy 0.5878 | Time 0.4972
Testing...
12it [00:00, 88.09it/s]
Test Accuracy 0.5279 ogbn-papers100MOld: $ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-papers100M
Training with 4 gpus.
The dataset is already preprocessed.
Training...
294it [00:22, 13.15it/s]
Validating...
31it [00:02, 14.12it/s]
Epoch 00000 | Average Loss 1.9491 | Accuracy 0.5924 | Time 24.7810
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00001 | Average Loss 1.3033 | Accuracy 0.6245 | Time 23.8770
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00002 | Average Loss 1.2215 | Accuracy 0.6469 | Time 23.8830
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.56it/s]
Epoch 00003 | Average Loss 1.1796 | Accuracy 0.6448 | Time 23.8804
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00004 | Average Loss 1.1523 | Accuracy 0.6533 | Time 23.8787
294it [00:21, 13.58it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00005 | Average Loss 1.1338 | Accuracy 0.6464 | Time 23.9888
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.55it/s]
Epoch 00006 | Average Loss 1.1200 | Accuracy 0.6503 | Time 23.8843
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.52it/s]
Epoch 00007 | Average Loss 1.1080 | Accuracy 0.6569 | Time 23.8870
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00008 | Average Loss 1.0979 | Accuracy 0.6615 | Time 23.8950
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00009 | Average Loss 1.0894 | Accuracy 0.6603 | Time 23.8899
Testing...
53it [00:03, 14.50it/s]
Test Accuracy 0.6318 New: $ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-papers100M
Training with 4 gpus.
The dataset is already preprocessed.
Training...
294it [00:21, 13.69it/s]
Validating...
31it [00:02, 14.19it/s]
Epoch 00000 | Average Loss 1.9418 | Accuracy 0.5957 | Time 23.8790
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.65it/s]
Epoch 00001 | Average Loss 1.3039 | Accuracy 0.6233 | Time 23.0518
294it [00:20, 14.19it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00002 | Average Loss 1.2206 | Accuracy 0.6458 | Time 23.0501
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.62it/s]
Epoch 00003 | Average Loss 1.1800 | Accuracy 0.6493 | Time 23.0555
294it [00:20, 14.17it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00004 | Average Loss 1.1533 | Accuracy 0.6571 | Time 23.0787
294it [00:20, 14.11it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00005 | Average Loss 1.1354 | Accuracy 0.6563 | Time 23.1551
294it [00:20, 14.19it/s]
Validating...
31it [00:02, 14.56it/s]
Epoch 00006 | Average Loss 1.1197 | Accuracy 0.6585 | Time 23.0504
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00007 | Average Loss 1.1088 | Accuracy 0.6571 | Time 23.0587
294it [00:20, 14.21it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00008 | Average Loss 1.0991 | Accuracy 0.6616 | Time 23.0182
294it [00:20, 14.20it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00009 | Average Loss 1.0909 | Accuracy 0.6632 | Time 23.0365
Testing...
53it [00:03, 14.53it/s]
Test Accuracy 0.6337 |
Tested on g4dn.metal. |
@Rhett-Ying The issue of random seed has been resolved. What a relief that torch.distributed has convenient communicating APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This POC proves to work well both on correctness and performance. Now it's time to finalize the code change.
- Is it possible to update existing
ItemSampler
instead of creating a new class? Seems the major part is fixing theseed
? - is it possible to split the change on
ItemSampler
andItemSet/Dict
to make the change as small as possible for quick review?
@@ -36,6 +36,7 @@ | |||
│ | |||
└───> Test set evaluation | |||
""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it work well with --num-workers 2
for multiple GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both old and new Implementation encounter the same error with --num-workers 2
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/utils/data/datapipes/datapipe.py", line 359, in __setstate__
self._datapipe = dill.loads(value)
File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 303, in loads
return load(file, ignore, **kwds)
File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 289, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 444, in load
obj = StockUnpickler.load(self)
AttributeError: 'PyCapsule' object has no attribute 'cudaHostUnregister'
Is this a long-standing problem? Or is there something wrong with my package version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid no one run the multi-gpu example with multiple num_workers
before. Please file an issue and look into it.
I'm afraid the change on |
Sounds good to me. |
Description
benchmark:
Checklist
Please feel free to remove inapplicable items for your PR.
Changes