feat: negative sampling for inductive data loading (#7331) #9152

eysu35 · 2024-04-04T13:41:05Z

I implemented a new Transform class that takes in a HeteroData object, such as one returned by the LinkNeighborLoader, and returns the same data with sampled negative edges attached. This is novel from existing solutions because the negative edge sampling is done in an inductive setting, where a limited subset of nodes are used to constrain the sampling.

In the example of my current project, we are creating a bipartite graph of drug and gene nodes, and we split our data via 3 different heuristics.

By drug (termed "source"). We partition the drug (source) nodes into train/valid/test in a 7/1/2 ratio, then label each edge [(drug, gene)] as train/valid/test to reflect the category of the drug node. This way we can "introduce" new drugs during validation and test to see how our model will perform on unseen drugs.
By gene (termed "target"). Same method, but all of the partitioning is done on the gene nodes.
By pair/edge (termed "pair"). This is the transductive learning case where we just randomly split our edges 70/10/20 into train/valid/test, and either a drug or gene node may be left out of one category, but none are necessarily excluded.

In the first two data splits, we noticed that the Negative Sampler inherent to LinkNeighborLoader was causing a data leak. To use "source" as an example, if source nodes 0-69 were in train, 70-79 were in validation, and 80-99 were in test, then at test time, all of the message edges would have a source node id between 0-79, and a target node id within the whole range of targets, and all of the supervision edges would have a source node id between 80-99, and a target node id again across all the targets. We want the sampled negative edges to match this pattern, and to only include the inductive source nodes (80-99) and to include all possible targets. Previously, we were seeing source node ids in the entire 0-100 range in the negative sampled edges returned by the data loader. Thus, we implemented our own Transform class such that it can be initialized with the true positive edges (to check against), the data split type, and the negative sampling ratio. This transform function can be included in the initialization of the dataloaders (we are using LinkNeighborLoader), such that each returned minibatch will be automatically transformed to include the restricted negative samples. We have done ample testing in our own project to make sure the test edges only include the unseen nodes in the positive and negative pairs, in order to more comprehensively evaluate the _inductive_performance of our model (so that previously seen source/target pairs are not accidentally dominating the high test scores). We noticed similar self-implementations of loaders in the inductive learning setting in our own literature review, so wanted to propose this method for others to share.

codecov · 2024-04-04T13:46:45Z

Codecov Report

Attention: Patch coverage is 0% with 57 lines in your changes are missing coverage. Please review.

Project coverage is 89.41%. Comparing base (38bb5f2) to head (7763afc).
Report is 30 commits behind head on master.

Files	Patch %	Lines
torch_geometric/transforms/add_negative_samples.py	0.00%	57 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9152      +/-   ##
==========================================
- Coverage   90.02%   89.41%   -0.62%     
==========================================
  Files         470      471       +1     
  Lines       30165    30222      +57     
==========================================
- Hits        27157    27023     -134     
- Misses       3008     3199     +191

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

open PR

709018e

eysu35 requested a review from wsad1 as a code owner April 4, 2024 13:41

github-actions bot added the transform label Apr 4, 2024

eysu35 and others added 2 commits April 4, 2024 09:58

change name

d115cb9

Merge branch 'master' into negative_sampling_transform

7763afc

eysu35 changed the title ~~feat: negative sampling for inductive learning cases (#7331)~~ feat: negative sampling for inductive data loading (#7331) Apr 8, 2024

rusty1s assigned eysu35 May 2, 2024

rusty1s added feature 1 - Priority P1 labels May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: negative sampling for inductive data loading (#7331) #9152

feat: negative sampling for inductive data loading (#7331) #9152

eysu35 commented Apr 4, 2024

codecov bot commented Apr 4, 2024 •

edited

feat: negative sampling for inductive data loading (#7331) #9152

Are you sure you want to change the base?

feat: negative sampling for inductive data loading (#7331) #9152

Conversation

eysu35 commented Apr 4, 2024

codecov bot commented Apr 4, 2024 • edited

Codecov Report

codecov bot commented Apr 4, 2024 •

edited