[GraphBolt][Bug] SEGV when preprocessing `OnDiskDataset` #7364

easypickings · 2024-04-27T06:16:22Z

🐛 Bug

To Reproduce

When trying to construct a OnDiskDataset with the UK-Union graph, I get segmentation fault during preprocessing. The error message is either munmap_chunk(): invalid pointer or double free or corruption (out). I further locate the error comes from the following line:

dgl/python/dgl/graphbolt/impl/ondisk_dataset.py

Line 97 in 1547bd9

indptr, indices, edge_ids = sparse_matrix.csc()

Steps to reproduce the behavior:

execute the code:

import dgl.graphbolt as gb
dataset = gb.OnDiskDataset("path/to/dataset")

Expected behavior

Environment

DGL Version (e.g., 1.0): 2.1.0+cu121
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.1.2+cu121
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.11
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

Rhett-Ying · 2024-04-28T01:28:47Z

could you make sure the num_nodes specified is exactly same as the node IDs read from edge file?
https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97

easypickings · 2024-04-28T04:04:47Z

could you make sure the num_nodes specified is exactly same as the node IDs read from edge file? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97

Yes, the node ids in the edge file are consecutive from 0 to num_nodes -1. Also I can construct the coo and csc matrix using scipy.sparse.

Rhett-Ying · 2024-04-28T04:09:19Z

how large is your dataset? num_nodes, num_edges?

And could you try to comment out below line?
https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L96C13-L96C23

easypickings · 2024-04-28T04:31:20Z

num_nodes = 131814559 and num_edges = 5507679822.
comment out is no use.

Rhett-Ying · 2024-04-28T04:35:34Z

oh, it's a large graph with more than 5B edges. what's your instance for running this? how much is then RAM?

easypickings · 2024-04-28T07:12:18Z

I'm running on an aliyun server with over 700GB RAM

Rhett-Ying · 2024-04-28T07:27:42Z

@yxy235 could you try to reproduce this error on r6i.metal with a random graph?

yxy235 · 2024-04-28T07:35:13Z

@yxy235 could you try to reproduce this error on r6i.metal with a random graph?

OK

yxy235 · 2024-04-29T07:49:53Z

I have tried to reproduce this, but I didn't get any errors with a random same-size graph.

easypickings · 2024-04-29T17:35:43Z

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

yxy235 · 2024-04-30T06:50:58Z

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

OK. I have reproduced the error, I'm trying to debug now.

yxy235 · 2024-05-06T07:50:16Z

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

@easypickings Could you try to change the dtype of your edge.npy to int64? I think this problem can be resolved. This problem is caused by edge number exceeds int32. This caused error during constructing SparseMatrix from coo to csc. The dtype change is a workaround to solve the problem temporarily. FYI, this workaround may cause double memoery consumption.

yxy235 · 2024-05-16T04:23:03Z

TBD：
Functions used in

dgl/src/array/cpu/spmat_op_impl_coo.cc

Line 749 in f0213d2

switch (WhichCOOToCSR<IdType>(coo)) {

should be check, especisally

dgl/src/array/cpu/spmat_op_impl_coo.cc

Line 538 in f0213d2

CSRMatrix UnSortedDenseCOOToCSR(const COOMatrix &coo) {

.
We should determine dtype of csr through coo.row->shape[0] rather than coo.row->dtype. If shape is bigger than MAX_INT32 and no matter coo.row->dtype is int32 or int64, we should use int64.

Rhett-Ying assigned yxy235 Apr 28, 2024

Rhett-Ying added the bug:unconfirmed May be a bug. Need further investigation. label Apr 29, 2024

Rhett-Ying added bug:confirmed Something isn't working and removed bug:unconfirmed May be a bug. Need further investigation. labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt][Bug] SEGV when preprocessing `OnDiskDataset` #7364

[GraphBolt][Bug] SEGV when preprocessing `OnDiskDataset` #7364

easypickings commented Apr 27, 2024

Rhett-Ying commented Apr 28, 2024

easypickings commented Apr 28, 2024

Rhett-Ying commented Apr 28, 2024

easypickings commented Apr 28, 2024

Rhett-Ying commented Apr 28, 2024

easypickings commented Apr 28, 2024

Rhett-Ying commented Apr 28, 2024

yxy235 commented Apr 28, 2024

yxy235 commented Apr 29, 2024

easypickings commented Apr 29, 2024

yxy235 commented Apr 30, 2024

yxy235 commented May 6, 2024 •

edited

yxy235 commented May 16, 2024

[GraphBolt][Bug] SEGV when preprocessing OnDiskDataset #7364

[GraphBolt][Bug] SEGV when preprocessing OnDiskDataset #7364

Comments

easypickings commented Apr 27, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Rhett-Ying commented Apr 28, 2024

easypickings commented Apr 28, 2024

Rhett-Ying commented Apr 28, 2024

easypickings commented Apr 28, 2024

Rhett-Ying commented Apr 28, 2024

easypickings commented Apr 28, 2024

Rhett-Ying commented Apr 28, 2024

yxy235 commented Apr 28, 2024

yxy235 commented Apr 29, 2024

easypickings commented Apr 29, 2024

yxy235 commented Apr 30, 2024

yxy235 commented May 6, 2024 • edited

yxy235 commented May 16, 2024

[GraphBolt][Bug] SEGV when preprocessing `OnDiskDataset` #7364

[GraphBolt][Bug] SEGV when preprocessing `OnDiskDataset` #7364

yxy235 commented May 6, 2024 •

edited