Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

want to pretrain on my own datasets #21

Open
Juicechen95 opened this issue Nov 11, 2020 · 5 comments
Open

want to pretrain on my own datasets #21

Juicechen95 opened this issue Nov 11, 2020 · 5 comments

Comments

@Juicechen95
Copy link

Hi, acbull~ I think this algorithm is very interesting and I really want to test on my own graph dataset. It is there any advice or tips on how to prepare my own pretrain graph data? Thank you very much~~

@acbull
Copy link
Owner

acbull commented Nov 11, 2020

Hi:

You can simply follow the similar paradigm of prepreocess_.py to parse your graph into our data formula, and then just run pretrain_.py over that parsed graph.

Or, if you want to merge our code into your own system, maybe you can rewrite the data structure, but everything else is similar.

@Juicechen95
Copy link
Author

Thank you very much for your advice, I will try it~

@Juicechen95
Copy link
Author

My own dataset contains more than 10 million nodes. I see in the paper that OAG dataset contains more than 178 million nodes, but I just find out that only about 1 million nodes are used for pretraining according to the pretrain_OAG.py, is that number right?
I don't know why only small part of OAG are used for pretrain. And I really wonder how large a dataset this method can be used in because my dataset is very large. Is that realistic? Will I meet some internal storage problems or we can not sample out the small graph? Or do you have any idea about how to do pretraining on a very large dataset?
I will be very grateful for your answer~~ Thanks a lot!!!

@acbull
Copy link
Owner

acbull commented Nov 12, 2020

Since we utilize subgraph sampling during training, the size of the pretraining graph is not that matter. In experiments, I also try on the whole OAG dataset, but it's too big so I didn't provide it in google drive. But obviously, you can use our code to do pretraining on a super-large dataset.

@Juicechen95
Copy link
Author

Thank you for your patient reply~I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants