Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: negative sampling for inductive data loading (#7331) #9152

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

eysu35
Copy link

@eysu35 eysu35 commented Apr 4, 2024

I implemented a new Transform class that takes in a HeteroData object, such as one returned by the LinkNeighborLoader, and returns the same data with sampled negative edges attached. This is novel from existing solutions because the negative edge sampling is done in an inductive setting, where a limited subset of nodes are used to constrain the sampling.

In the example of my current project, we are creating a bipartite graph of drug and gene nodes, and we split our data via 3 different heuristics.

  1. By drug (termed "source"). We partition the drug (source) nodes into train/valid/test in a 7/1/2 ratio, then label each edge [(drug, gene)] as train/valid/test to reflect the category of the drug node. This way we can "introduce" new drugs during validation and test to see how our model will perform on unseen drugs.
  2. By gene (termed "target"). Same method, but all of the partitioning is done on the gene nodes.
  3. By pair/edge (termed "pair"). This is the transductive learning case where we just randomly split our edges 70/10/20 into train/valid/test, and either a drug or gene node may be left out of one category, but none are necessarily excluded.

In the first two data splits, we noticed that the Negative Sampler inherent to LinkNeighborLoader was causing a data leak. To use "source" as an example, if source nodes 0-69 were in train, 70-79 were in validation, and 80-99 were in test, then at test time, all of the message edges would have a source node id between 0-79, and a target node id within the whole range of targets, and all of the supervision edges would have a source node id between 80-99, and a target node id again across all the targets. We want the sampled negative edges to match this pattern, and to only include the inductive source nodes (80-99) and to include all possible targets. Previously, we were seeing source node ids in the entire 0-100 range in the negative sampled edges returned by the data loader. Thus, we implemented our own Transform class such that it can be initialized with the true positive edges (to check against), the data split type, and the negative sampling ratio. This transform function can be included in the initialization of the dataloaders (we are using LinkNeighborLoader), such that each returned minibatch will be automatically transformed to include the restricted negative samples. We have done ample testing in our own project to make sure the test edges only include the unseen nodes in the positive and negative pairs, in order to more comprehensively evaluate the _inductive_performance of our model (so that previously seen source/target pairs are not accidentally dominating the high test scores). We noticed similar self-implementations of loaders in the inductive learning setting in our own literature review, so wanted to propose this method for others to share.

@eysu35 eysu35 requested a review from wsad1 as a code owner April 4, 2024 13:41
Copy link

codecov bot commented Apr 4, 2024

Codecov Report

Attention: Patch coverage is 0% with 57 lines in your changes are missing coverage. Please review.

Project coverage is 89.41%. Comparing base (38bb5f2) to head (7763afc).
Report is 30 commits behind head on master.

Files Patch % Lines
torch_geometric/transforms/add_negative_samples.py 0.00% 57 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9152      +/-   ##
==========================================
- Coverage   90.02%   89.41%   -0.62%     
==========================================
  Files         470      471       +1     
  Lines       30165    30222      +57     
==========================================
- Hits        27157    27023     -134     
- Misses       3008     3199     +191     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eysu35 eysu35 changed the title feat: negative sampling for inductive learning cases (#7331) feat: negative sampling for inductive data loading (#7331) Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants