Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace NeighborSampler with NeighborLoader in mag240m #382

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

yanbing-j
Copy link

Currently, this PR is a draft PR that contains many print log.

path = osp.join(self.dir, 'processed', 'paper', 'node_label.npy')
data["paper"].y = torch.from_numpy(np.load(path))
path = osp.join(self.dir, 'processed', 'paper', 'node_year.npy')
data["paper"].year = torch.from_numpy(np.load(path, mmap_mode='r'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to add data['author'].num_nodes = ... and data['institution'].num_nodes = ... to register them as node types.

Copy link
Author

@yanbing-j yanbing-j Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data['author'].num_nodes = self.__meta__['author']
data['institution'].num_nodes = self.__meta__['institution']
I add these two lines to register author and institution as node types. And the RuntimeError is stil there.

def to_pyg_hetero_data(self):
data = HeteroData()
path = osp.join(self.dir, 'processed', 'paper', 'node_feat.npy')
# Current is not in-memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean:

```suggestion
        # Currently in-memory only

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data["paper"].x = torch.from_numpy(np.load(path, mmap_mode='r')) is from @property def paper_label(self)..., which is called when self.in_memory is False. So I comment here, to remind myself to enable in_memory part.

name = f'{src}___{rel}___{dst}'
path = osp.join(self.dir, 'processed', name, 'edge_index.npy')
return np.load(path)
# def edge_index(self, id1: str, id2: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment back in?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function edge_index is no need any more. The edge_index info can be found in data[(('author', 'writes', 'paper'))].edge_index, data[('author', 'affiliated_with', 'institution')].edge_index and data[('paper', 'cites', 'paper')].edge_index, right?

@@ -163,7 +183,8 @@ def save_test_submission(self, input_dict: Dict, dir_path: str, mode: str):


if __name__ == '__main__':
dataset = MAG240MDataset()
dataset = MAG240MDataset('/home/user/yanbing/pyg/ogb/ogb/lsc/dataset')
data = dataset.to_pyg_hetero_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test this separately?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/home/user/yanbing/pyg/ogb/ogb/lsc/dataset is the dev root, will remove it.

adjs_t=[adj_t.to(*args, **kwargs) for adj_t in self.adjs_t],
)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class MAG240M(LightningDataModule):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try to make use of torch_geometric.data.LightningNodeDataset for this. This would simplify the construction of neighbor loaders.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. There is no LightningNodeDataset in pyg.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean LightningNodeData? Will try this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code using LightningNodeData, but it still get the RuntimeError Node conv1__paper1 target conv1.author__writes__paper references nonexistent attribute author__writes__paper of conv1.

@puririshi98
Copy link
Contributor

@yanbing-j if not opposed I can take this over when I find time in the next few weeks and finish this PR as it is needed for my work

@yanbing-j
Copy link
Author

@puririshi98 Sure. Please go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants