Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt][Bug] SEGV when preprocessing OnDiskDataset #7364

Open
easypickings opened this issue Apr 27, 2024 · 13 comments
Open

[GraphBolt][Bug] SEGV when preprocessing OnDiskDataset #7364

easypickings opened this issue Apr 27, 2024 · 13 comments
Assignees
Labels
bug:confirmed Something isn't working

Comments

@easypickings
Copy link

🐛 Bug

To Reproduce

When trying to construct a OnDiskDataset with the UK-Union graph, I get segmentation fault during preprocessing. The error message is either munmap_chunk(): invalid pointer or double free or corruption (out). I further locate the error comes from the following line:

indptr, indices, edge_ids = sparse_matrix.csc()

Steps to reproduce the behavior:

execute the code:

import dgl.graphbolt as gb
dataset = gb.OnDiskDataset("path/to/dataset")

Expected behavior

Environment

  • DGL Version (e.g., 1.0): 2.1.0+cu121
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.1.2+cu121
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.11
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

@Rhett-Ying
Copy link
Collaborator

could you make sure the num_nodes specified is exactly same as the node IDs read from edge file?
https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97

@easypickings
Copy link
Author

could you make sure the num_nodes specified is exactly same as the node IDs read from edge file? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97

Yes, the node ids in the edge file are consecutive from 0 to num_nodes -1. Also I can construct the coo and csc matrix using scipy.sparse.

@Rhett-Ying
Copy link
Collaborator

how large is your dataset? num_nodes, num_edges?

And could you try to comment out below line?
https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L96C13-L96C23

@easypickings
Copy link
Author

num_nodes = 131814559 and num_edges = 5507679822.
comment out is no use.

@Rhett-Ying
Copy link
Collaborator

oh, it's a large graph with more than 5B edges. what's your instance for running this? how much is then RAM?

@easypickings
Copy link
Author

I'm running on an aliyun server with over 700GB RAM

@Rhett-Ying
Copy link
Collaborator

@yxy235 could you try to reproduce this error on r6i.metal with a random graph?

@yxy235
Copy link
Collaborator

yxy235 commented Apr 28, 2024

@yxy235 could you try to reproduce this error on r6i.metal with a random graph?

OK

@yxy235
Copy link
Collaborator

yxy235 commented Apr 29, 2024

I have tried to reproduce this, but I didn't get any errors with a random same-size graph.

@Rhett-Ying Rhett-Ying added the bug:unconfirmed May be a bug. Need further investigation. label Apr 29, 2024
@easypickings
Copy link
Author

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

@yxy235
Copy link
Collaborator

yxy235 commented Apr 30, 2024

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

OK. I have reproduced the error, I'm trying to debug now.

@yxy235
Copy link
Collaborator

yxy235 commented May 6, 2024

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

@easypickings Could you try to change the dtype of your edge.npy to int64? I think this problem can be resolved. This problem is caused by edge number exceeds int32. This caused error during constructing SparseMatrix from coo to csc. The dtype change is a workaround to solve the problem temporarily. FYI, this workaround may cause double memoery consumption.

@yxy235
Copy link
Collaborator

yxy235 commented May 16, 2024

TBD:
Functions used in

switch (WhichCOOToCSR<IdType>(coo)) {
should be check, especisally
CSRMatrix UnSortedDenseCOOToCSR(const COOMatrix &coo) {
.
We should determine dtype of csr through coo.row->shape[0] rather than coo.row->dtype. If shape is bigger than MAX_INT32 and no matter coo.row->dtype is int32 or int64, we should use int64.

@Rhett-Ying Rhett-Ying added bug:confirmed Something isn't working and removed bug:unconfirmed May be a bug. Need further investigation. labels May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants