Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Add experimental ItemSet/Dict4 and ItemSampler4 #7371

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Skeleton003
Copy link
Collaborator

@Skeleton003 Skeleton003 commented Apr 29, 2024

Description

benchmark:

num_ids: 24, num_workers: 0, drop_last: False, drop_uneven_inputs: False
Old: 5.26561164855957
New: 4.075196266174316

num_ids: 24, num_workers: 0, drop_last: False, drop_uneven_inputs: True
Old: 5.038467884063721
New: 5.061769247055054

num_ids: 24, num_workers: 0, drop_last: True, drop_uneven_inputs: False
Old: 5.08016300201416
New: 5.055493116378784

num_ids: 24, num_workers: 0, drop_last: True, drop_uneven_inputs: True
Old: 5.044290542602539
New: 5.022970676422119

num_ids: 24, num_workers: 2, drop_last: False, drop_uneven_inputs: False
Old: 7.418801546096802
New: 6.484843492507935

num_ids: 24, num_workers: 2, drop_last: False, drop_uneven_inputs: True
Old: 7.407760143280029
New: 7.527584791183472

num_ids: 24, num_workers: 2, drop_last: True, drop_uneven_inputs: False
Old: 6.492152690887451
New: 6.431138277053833

num_ids: 24, num_workers: 2, drop_last: True, drop_uneven_inputs: True
Old: 6.491805791854858
New: 7.4210569858551025

num_ids: 30, num_workers: 0, drop_last: False, drop_uneven_inputs: False
Old: 4.09150767326355
New: 5.011434316635132

num_ids: 30, num_workers: 0, drop_last: False, drop_uneven_inputs: True
Old: 5.040276288986206
New: 4.068592071533203

num_ids: 30, num_workers: 0, drop_last: True, drop_uneven_inputs: False
Old: 4.038927793502808
New: 4.038530349731445

num_ids: 30, num_workers: 0, drop_last: True, drop_uneven_inputs: True
Old: 5.019740343093872
New: 5.0285563468933105

num_ids: 30, num_workers: 2, drop_last: False, drop_uneven_inputs: False
Old: 7.428295612335205
New: 6.409729242324829

num_ids: 30, num_workers: 2, drop_last: False, drop_uneven_inputs: True
Old: 7.421130657196045
New: 7.533393383026123

num_ids: 30, num_workers: 2, drop_last: True, drop_uneven_inputs: False
Old: 7.41476035118103
New: 6.400209188461304

num_ids: 30, num_workers: 2, drop_last: True, drop_uneven_inputs: True
Old: 6.40072774887085
New: 6.447648048400879

num_ids: 32, num_workers: 0, drop_last: False, drop_uneven_inputs: False
Old: 4.057007789611816
New: 5.063795328140259

num_ids: 32, num_workers: 0, drop_last: False, drop_uneven_inputs: True
Old: 5.035150051116943
New: 5.006322145462036

num_ids: 32, num_workers: 0, drop_last: True, drop_uneven_inputs: False
Old: 5.089540958404541
New: 5.047980546951294

num_ids: 32, num_workers: 0, drop_last: True, drop_uneven_inputs: True
Old: 4.040552854537964
New: 5.0497941970825195

num_ids: 32, num_workers: 2, drop_last: False, drop_uneven_inputs: False
Old: 7.43973970413208
New: 7.493116855621338

num_ids: 32, num_workers: 2, drop_last: False, drop_uneven_inputs: True
Old: 7.553787469863892
New: 7.6020872592926025

num_ids: 32, num_workers: 2, drop_last: True, drop_uneven_inputs: False
Old: 6.490302085876465
New: 7.487463474273682

num_ids: 32, num_workers: 2, drop_last: True, drop_uneven_inputs: True
Old: 7.364883661270142
New: 7.4597368240356445

num_ids: 34, num_workers: 0, drop_last: False, drop_uneven_inputs: False
Old: 4.082199811935425
New: 4.053929328918457

num_ids: 34, num_workers: 0, drop_last: False, drop_uneven_inputs: True
Old: 4.063207149505615
New: 5.091043710708618

num_ids: 34, num_workers: 0, drop_last: True, drop_uneven_inputs: False
Old: 4.999620676040649
New: 5.112699031829834

num_ids: 34, num_workers: 0, drop_last: True, drop_uneven_inputs: True
Old: 5.024035930633545
New: 4.051522493362427

num_ids: 34, num_workers: 2, drop_last: False, drop_uneven_inputs: False
Old: 7.471214771270752
New: 6.554701328277588

num_ids: 34, num_workers: 2, drop_last: False, drop_uneven_inputs: True
Old: 6.449496269226074
New: 6.529990196228027

num_ids: 34, num_workers: 2, drop_last: True, drop_uneven_inputs: False
Old: 7.431456804275513
New: 7.41823673248291

num_ids: 34, num_workers: 2, drop_last: True, drop_uneven_inputs: True
Old: 6.479130506515503
New: 6.368876695632935

num_ids: 36, num_workers: 0, drop_last: False, drop_uneven_inputs: False
Old: 5.009013652801514
New: 5.050375461578369

num_ids: 36, num_workers: 0, drop_last: False, drop_uneven_inputs: True
Old: 4.0677573680877686
New: 4.125107288360596

num_ids: 36, num_workers: 0, drop_last: True, drop_uneven_inputs: False
Old: 5.023468971252441
New: 5.105181455612183

num_ids: 36, num_workers: 0, drop_last: True, drop_uneven_inputs: True
Old: 5.063021421432495
New: 5.089923143386841

num_ids: 36, num_workers: 2, drop_last: False, drop_uneven_inputs: False
Old: 7.424851179122925
New: 7.432251453399658

num_ids: 36, num_workers: 2, drop_last: False, drop_uneven_inputs: True
Old: 7.543227672576904
New: 7.601431131362915

num_ids: 36, num_workers: 2, drop_last: True, drop_uneven_inputs: False
Old: 6.457719326019287
New: 6.451065540313721

num_ids: 36, num_workers: 2, drop_last: True, drop_uneven_inputs: True
Old: 6.568817377090454
New: 6.491897344589233

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 29, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 29, 2024

Commit ID: 78b9e90

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 29, 2024

Commit ID: 6354418

Build ID: 2

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 29, 2024

Commit ID: 5a5c786

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 5, 2024

Commit ID: 22180a5053a84344aceea9926ea4e80b83ff7cfb

Build ID: 4

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 5, 2024

Commit ID: a2ca65173ae263f16b8b2182b782e040cd08c080

Build ID: 5

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 6, 2024

Commit ID: 9d2e81a4480acdb79c59233c2efef9893e857d96

Build ID: 6

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

@Rhett-Ying Benchmark shows that the variation on performance is acceptable. I'am trying to find out a way to enable all replicas to obtain a random seed from the main process instead of letting user manually set it, but this is yet another topic. For now, I think we can merge this PR first.

@Rhett-Ying
Copy link
Collaborator

Rhett-Ying commented May 6, 2024

num_ids: 36, num_workers: 2

num_ids is the total number of ItemSet or ItemSetDict? If yes, it's too small and not persuasive.

@Skeleton003
Copy link
Collaborator Author

benchmark on /dgl/examples/multigpu/graphbolt/node_classification.py:

ogbn-products

Old:

$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3
Training with 4 gpus.
The dataset is already preprocessed.
Training...
48it [00:02, 16.06it/s]
Validating...
10it [00:00, 21.67it/s]
Epoch 00000 | Average Loss 2.3267 | Accuracy 0.7917 | Time 3.5637
48it [00:02, 21.37it/s]
Validating...
10it [00:00, 24.19it/s]
Epoch 00001 | Average Loss 0.9559 | Accuracy 0.8437 | Time 2.7528
48it [00:02, 21.33it/s]
Validating...
10it [00:00, 24.10it/s]
Epoch 00002 | Average Loss 0.7238 | Accuracy 0.8602 | Time 2.7597
48it [00:02, 21.33it/s]
Validating...
10it [00:00, 24.51it/s]
Epoch 00003 | Average Loss 0.6163 | Accuracy 0.8706 | Time 2.7502
48it [00:02, 21.45it/s]
Validating...
10it [00:00, 24.45it/s]
Epoch 00004 | Average Loss 0.5578 | Accuracy 0.8762 | Time 2.7404
48it [00:02, 20.19it/s]
Validating...
10it [00:00, 24.57it/s]
Epoch 00005 | Average Loss 0.5176 | Accuracy 0.8819 | Time 2.8776
48it [00:02, 21.50it/s]
Validating...
10it [00:00, 24.13it/s]
Epoch 00006 | Average Loss 0.4883 | Accuracy 0.8855 | Time 2.7396
48it [00:02, 21.42it/s]
Validating...
10it [00:00, 24.41it/s]
Epoch 00007 | Average Loss 0.4667 | Accuracy 0.8881 | Time 2.7437
48it [00:02, 21.31it/s]
Validating...
10it [00:00, 24.19it/s]
Epoch 00008 | Average Loss 0.4477 | Accuracy 0.8889 | Time 2.7596
48it [00:02, 21.46it/s]
Validating...
10it [00:00, 24.29it/s]
Epoch 00009 | Average Loss 0.4343 | Accuracy 0.8920 | Time 2.7416
Testing...
541it [00:19, 27.95it/s]
Test Accuracy 0.7348

New:

$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3
Training with 4 gpus.
The dataset is already preprocessed.
Training...
48it [00:03, 15.84it/s]
Validating...
10it [00:00, 22.02it/s]
Epoch 00000 | Average Loss 2.3048 | Accuracy 0.7777 | Time 3.5975
48it [00:02, 21.28it/s]
Validating...
10it [00:00, 25.05it/s]
Epoch 00001 | Average Loss 0.9804 | Accuracy 0.8388 | Time 2.7448
48it [00:02, 21.31it/s]
Validating...
10it [00:00, 24.98it/s]
Epoch 00002 | Average Loss 0.7427 | Accuracy 0.8587 | Time 2.7464
48it [00:02, 21.43it/s]
Validating...
10it [00:00, 25.03it/s]
Epoch 00003 | Average Loss 0.6308 | Accuracy 0.8696 | Time 2.7333
48it [00:02, 21.40it/s]
Validating...
10it [00:00, 25.19it/s]
Epoch 00004 | Average Loss 0.5623 | Accuracy 0.8785 | Time 2.7332
48it [00:02, 20.29it/s]
Validating...
10it [00:00, 24.69it/s]
Epoch 00005 | Average Loss 0.5228 | Accuracy 0.8815 | Time 2.8657
48it [00:02, 21.37it/s]
Validating...
10it [00:00, 24.89it/s]
Epoch 00006 | Average Loss 0.4937 | Accuracy 0.8850 | Time 2.7418
48it [00:02, 21.41it/s]
Validating...
10it [00:00, 25.01it/s]
Epoch 00007 | Average Loss 0.4696 | Accuracy 0.8879 | Time 2.7378
48it [00:02, 21.36it/s]
Validating...
10it [00:00, 25.03it/s]
Epoch 00008 | Average Loss 0.4537 | Accuracy 0.8909 | Time 2.7409
48it [00:02, 21.40it/s]
Validating...
10it [00:00, 24.88it/s]
Epoch 00009 | Average Loss 0.4388 | Accuracy 0.8932 | Time 2.7407
Testing...
541it [00:19, 27.96it/s]
Test Accuracy 0.7393

ogbn-arxiv

Old:

$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-arxiv
Training with 4 gpus.
The dataset is already preprocessed.
Training...
22it [00:01, 21.57it/s]
Validating...
8it [00:00, 52.40it/s]
Epoch 00000 | Average Loss 3.2543 | Accuracy 0.3002 | Time 1.2109
22it [00:00, 54.33it/s]
Validating...
8it [00:00, 70.41it/s]
Epoch 00001 | Average Loss 2.5287 | Accuracy 0.4404 | Time 0.5230
22it [00:00, 59.90it/s]
Validating...
8it [00:00, 71.66it/s]
Epoch 00002 | Average Loss 2.1985 | Accuracy 0.5054 | Time 0.4818
22it [00:00, 54.64it/s]
Validating...
8it [00:00, 86.39it/s]
Epoch 00003 | Average Loss 1.9795 | Accuracy 0.5349 | Time 0.4978
22it [00:00, 57.34it/s]
Validating...
8it [00:00, 78.11it/s]
Epoch 00004 | Average Loss 1.8419 | Accuracy 0.5529 | Time 0.4944
22it [00:00, 42.99it/s]
Validating...
8it [00:00, 73.39it/s]
Epoch 00005 | Average Loss 1.7533 | Accuracy 0.5649 | Time 0.6252
22it [00:00, 56.13it/s]
Validating...
8it [00:00, 76.69it/s]
Epoch 00006 | Average Loss 1.6852 | Accuracy 0.5713 | Time 0.5014
22it [00:00, 52.51it/s]
Validating...
8it [00:00, 79.52it/s]
Epoch 00007 | Average Loss 1.6405 | Accuracy 0.5766 | Time 0.5221
22it [00:00, 59.19it/s]
Validating...
8it [00:00, 67.85it/s]
Epoch 00008 | Average Loss 1.6055 | Accuracy 0.5814 | Time 0.4923
22it [00:00, 60.42it/s]
Validating...
8it [00:00, 71.80it/s]
Epoch 00009 | Average Loss 1.5681 | Accuracy 0.5878 | Time 0.4783
Testing...
12it [00:00, 82.86it/s]
Test Accuracy 0.5271

New:

$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-arxiv
Training with 4 gpus.
The dataset is already preprocessed.
Training...
22it [00:01, 18.31it/s]
Validating...
8it [00:00, 54.37it/s]
Epoch 00000 | Average Loss 3.1735 | Accuracy 0.2941 | Time 1.3790
22it [00:00, 58.89it/s]
Validating...
8it [00:00, 78.07it/s]
Epoch 00001 | Average Loss 2.4895 | Accuracy 0.4520 | Time 0.4908
22it [00:00, 56.94it/s]
Validating...
8it [00:00, 73.67it/s]
Epoch 00002 | Average Loss 2.1515 | Accuracy 0.5135 | Time 0.5007
22it [00:00, 54.02it/s]
Validating...
8it [00:00, 69.11it/s]
Epoch 00003 | Average Loss 1.9372 | Accuracy 0.5381 | Time 0.5256
22it [00:00, 56.69it/s]
Validating...
8it [00:00, 70.72it/s]
Epoch 00004 | Average Loss 1.8119 | Accuracy 0.5560 | Time 0.5067
22it [00:00, 39.94it/s]
Validating...
8it [00:00, 74.97it/s]
Epoch 00005 | Average Loss 1.7279 | Accuracy 0.5639 | Time 0.6646
22it [00:00, 56.77it/s]
Validating...
8it [00:00, 79.99it/s]
Epoch 00006 | Average Loss 1.6723 | Accuracy 0.5734 | Time 0.4928
22it [00:00, 60.43it/s]
Validating...
8it [00:00, 71.34it/s]
Epoch 00007 | Average Loss 1.6253 | Accuracy 0.5817 | Time 0.4789
22it [00:00, 58.53it/s]
Validating...
8it [00:00, 91.09it/s]
Epoch 00008 | Average Loss 1.5881 | Accuracy 0.5844 | Time 0.4690
22it [00:00, 56.57it/s]
Validating...
8it [00:00, 77.58it/s]
Epoch 00009 | Average Loss 1.5577 | Accuracy 0.5878 | Time 0.4972
Testing...
12it [00:00, 88.09it/s]
Test Accuracy 0.5279

ogbn-papers100M

Old:

$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-papers100M
Training with 4 gpus.
The dataset is already preprocessed.
Training...
294it [00:22, 13.15it/s]
Validating...
31it [00:02, 14.12it/s]
Epoch 00000 | Average Loss 1.9491 | Accuracy 0.5924 | Time 24.7810
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00001 | Average Loss 1.3033 | Accuracy 0.6245 | Time 23.8770
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00002 | Average Loss 1.2215 | Accuracy 0.6469 | Time 23.8830
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.56it/s]
Epoch 00003 | Average Loss 1.1796 | Accuracy 0.6448 | Time 23.8804
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00004 | Average Loss 1.1523 | Accuracy 0.6533 | Time 23.8787
294it [00:21, 13.58it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00005 | Average Loss 1.1338 | Accuracy 0.6464 | Time 23.9888
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.55it/s]
Epoch 00006 | Average Loss 1.1200 | Accuracy 0.6503 | Time 23.8843
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.52it/s]
Epoch 00007 | Average Loss 1.1080 | Accuracy 0.6569 | Time 23.8870
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00008 | Average Loss 1.0979 | Accuracy 0.6615 | Time 23.8950
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00009 | Average Loss 1.0894 | Accuracy 0.6603 | Time 23.8899
Testing...
53it [00:03, 14.50it/s]
Test Accuracy 0.6318

New:

$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-papers100M
Training with 4 gpus.
The dataset is already preprocessed.
Training...
294it [00:21, 13.69it/s]
Validating...
31it [00:02, 14.19it/s]
Epoch 00000 | Average Loss 1.9418 | Accuracy 0.5957 | Time 23.8790
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.65it/s]
Epoch 00001 | Average Loss 1.3039 | Accuracy 0.6233 | Time 23.0518
294it [00:20, 14.19it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00002 | Average Loss 1.2206 | Accuracy 0.6458 | Time 23.0501
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.62it/s]
Epoch 00003 | Average Loss 1.1800 | Accuracy 0.6493 | Time 23.0555
294it [00:20, 14.17it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00004 | Average Loss 1.1533 | Accuracy 0.6571 | Time 23.0787
294it [00:20, 14.11it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00005 | Average Loss 1.1354 | Accuracy 0.6563 | Time 23.1551
294it [00:20, 14.19it/s]
Validating...
31it [00:02, 14.56it/s]
Epoch 00006 | Average Loss 1.1197 | Accuracy 0.6585 | Time 23.0504
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00007 | Average Loss 1.1088 | Accuracy 0.6571 | Time 23.0587
294it [00:20, 14.21it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00008 | Average Loss 1.0991 | Accuracy 0.6616 | Time 23.0182
294it [00:20, 14.20it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00009 | Average Loss 1.0909 | Accuracy 0.6632 | Time 23.0365
Testing...
53it [00:03, 14.53it/s]
Test Accuracy 0.6337

@Skeleton003
Copy link
Collaborator Author

Tested on g4dn.metal.

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 6, 2024

Commit ID: 0091ccae666bf1915f2022dfd420afd049186a5e

Build ID: 7

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 6, 2024

Commit ID: 0b00f16c08a5242b4a592cf9565b21eb69e80eb0

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@Skeleton003
Copy link
Collaborator Author

@Rhett-Ying The issue of random seed has been resolved. What a relief that torch.distributed has convenient communicating APIs.

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 6, 2024

Commit ID: 3ca84f39e44065cc93c484672d8639ddd152bf09

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 7, 2024

Commit ID: 08ac1ebbba8514a5eeea4ffcbeac85204335468d

Build ID: 10

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

Copy link
Collaborator

@Rhett-Ying Rhett-Ying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This POC proves to work well both on correctness and performance. Now it's time to finalize the code change.

  1. Is it possible to update existing ItemSampler instead of creating a new class? Seems the major part is fixing the seed?
  2. is it possible to split the change on ItemSampler and ItemSet/Dict to make the change as small as possible for quick review?

@@ -36,6 +36,7 @@
└───> Test set evaluation
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work well with --num-workers 2 for multiple GPUs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both old and new Implementation encounter the same error with --num-workers 2

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/utils/data/datapipes/datapipe.py", line 359, in __setstate__
    self._datapipe = dill.loads(value)
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 303, in loads
    return load(file, ignore, **kwds)
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 289, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/home/ubuntu/miniconda3/envs/dgl/lib/python3.9/site-packages/dill/_dill.py", line 444, in load
    obj = StockUnpickler.load(self)
AttributeError: 'PyCapsule' object has no attribute 'cudaHostUnregister'

Is this a long-standing problem? Or is there something wrong with my package version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid no one run the multi-gpu example with multiple num_workers before. Please file an issue and look into it.

@Skeleton003
Copy link
Collaborator Author

This POC proves to work well both on correctness and performance. Now it's time to finalize the code change.

  1. Is it possible to update existing ItemSampler instead of creating a new class? Seems the major part is fixing the seed?
  2. is it possible to split the change on ItemSampler and ItemSet/Dict to make the change as small as possible for quick review?

I'm afraid the change on ItemSet/Dict cannot be separated because the new ItemSampler takes it as input. We have to modify them simultaneously. For the sake of code review, I think we can devide this PR into 2. The first adds ItemSet/Dict4 but remain the old ItemSetDict unchanged, the second updates the existing ItemSampler and replaces the old ItemSetDict with the new. If this is what you envision, I can get started on it right away.

@Rhett-Ying
Copy link
Collaborator

This POC proves to work well both on correctness and performance. Now it's time to finalize the code change.

  1. Is it possible to update existing ItemSampler instead of creating a new class? Seems the major part is fixing the seed?
  2. is it possible to split the change on ItemSampler and ItemSet/Dict to make the change as small as possible for quick review?

I'm afraid the change on ItemSet/Dict cannot be separated because the new ItemSampler takes it as input. We have to modify them simultaneously. For the sake of code review, I think we can devide this PR into 2. The first adds ItemSet/Dict4 but remain the old ItemSetDict unchanged, the second updates the existing ItemSampler and replaces the old ItemSetDict with the new. If this is what you envision, I can get started on it right away.

Sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr: Suspended PR status
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants