feat: negative sampling for inductive data loading (#7331) #9152
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I implemented a new Transform class that takes in a HeteroData object, such as one returned by the LinkNeighborLoader, and returns the same data with sampled negative edges attached. This is novel from existing solutions because the negative edge sampling is done in an inductive setting, where a limited subset of nodes are used to constrain the sampling.
In the example of my current project, we are creating a bipartite graph of drug and gene nodes, and we split our data via 3 different heuristics.
In the first two data splits, we noticed that the Negative Sampler inherent to LinkNeighborLoader was causing a data leak. To use "source" as an example, if source nodes 0-69 were in train, 70-79 were in validation, and 80-99 were in test, then at test time, all of the message edges would have a source node id between 0-79, and a target node id within the whole range of targets, and all of the supervision edges would have a source node id between 80-99, and a target node id again across all the targets. We want the sampled negative edges to match this pattern, and to only include the inductive source nodes (80-99) and to include all possible targets. Previously, we were seeing source node ids in the entire 0-100 range in the negative sampled edges returned by the data loader. Thus, we implemented our own Transform class such that it can be initialized with the true positive edges (to check against), the data split type, and the negative sampling ratio. This transform function can be included in the initialization of the dataloaders (we are using LinkNeighborLoader), such that each returned minibatch will be automatically transformed to include the restricted negative samples. We have done ample testing in our own project to make sure the test edges only include the unseen nodes in the positive and negative pairs, in order to more comprehensively evaluate the _inductive_performance of our model (so that previously seen source/target pairs are not accidentally dominating the high test scores). We noticed similar self-implementations of loaders in the inductive learning setting in our own literature review, so wanted to propose this method for others to share.