Error with RandomEdgeSplit for multilabel edge classification #9262

sadrahkm · 2024-04-30T17:47:24Z

🐛 Describe the bug

Recently, I've been dealing with a multi-label edge classification problem. In other words, an edge can have more than one label. So I implemented a simple GNN model to see if I get good results or not.

I have 935 types of labels and have encoded them using the MultiLabelBinarizer method in sklearn. I have tested and I'm sure that all the labels are 0 or 1.

But after splitting the edges using RandomEdgeSplit, I noticed that there are more than two types of labels in the test and validation tests. I mean in the train set, there are 1 and 0, but in the validation set there 0, 1, 2. This makes the work a little hard. In the following screenshot, I have shown this. The first cell is the original data which is encoded with MultiLabelBinarizer. The next three cells are train/val/test sets, respectively. These train/val/test sets are splitted using the RandomEdgeSplit that I've provided in the code block.

For example, I want to compute the AUC score in the test process. I have attached the code and errors that I've got. I don't what I should do why the edge splitter function returning more than two types of labels. I think it should only have 0 or 1. I would appreciate your help in this regards.

transform = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    disjoint_train_ratio=None,
    add_negative_train_samples=False,
)
train_data, val_data, test_data = transform(data)


from torchmetrics.classification import MultilabelAUROC
@torch.no_grad()
def test(data):
    model.eval()
    z = model.encode(data.x, data.edge_index)
    out = model.decode(z, data.edge_label_index)

    ml_auroc = MultilabelAUROC(num_labels=935, average="macro", thresholds=None)
    auc = ml_auroc(out.cpu(), data.edge_label.cpu())

    return auc

for epoch in range(1, 100):
    loss = train()
    val_auc = test(val_data)
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Val AUC: {val_auc:.4f}, Val AUPRRRC: {val_auprc:.4f}')

RuntimeError: Detected the following values in `target`: tensor([0, 1, 2]) but expected only the following values [0, 1].

Versions

Collecting environment information...
PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14) 12.2.0
Clang version: Could not collect
CMake version: version 3.25.1
Libc version: glibc-2.36

Python version: 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-6.1.0-20-amd64-x86_64-with-glibc2.36
...

The text was updated successfully, but these errors were encountered:

keeganq · 2024-05-06T21:00:37Z

I was able to reproduce this problem with a minimal example. The root cause is that when add_negative_train_samples=False, negative sampling still occurs for val and test examples.

pytorch_geometric/torch_geometric/transforms/random_link_split.py

Lines 223 to 232 in d2f6eba

    
           num_neg_train = 0 
        
           if self.add_negative_train_samples: 
        
               if num_disjoint > 0: 
        
                   num_neg_train = int(num_disjoint * self.neg_sampling_ratio) 
        
               else: 
        
                   num_neg_train = int(num_train * self.neg_sampling_ratio) 
        
           num_neg_val = int(num_val * self.neg_sampling_ratio) 
        
           num_neg_test = int(num_test * self.neg_sampling_ratio) 
        
           num_neg = num_neg_train + num_neg_val + num_neg_test

Unfortunately, this not only adds negative edges to the val_data and test_data, but also means that their edge labels are incremented by 1, whereas the train_data labels are unchanged. In your example, label 1 in val_data corresponds with label 0 in train_data, and so on. Label 0 in val_data indicates a negative link.

This seems like a very confusing kwarg, and possibly an unintended result? Would be happy to submit a PR to try to fix this.

keeganq · 2024-05-06T21:03:26Z

@sadrahkm A quick workaround is to pass the kwarg neg_sampling_ratio=0. to T.RandomLinkSplit. This will prevent negative sampling for the validation and test sets, and will also preserve the original labels in your dataset.

sadrahkm · 2024-05-11T14:04:47Z

Thank you @keeganq for your help

Right, I hadn't noticed that the add_negative_train_samples option is only working for training samples, and the validation/test sets are automatically considered for negative sampling.

Yes, putting neg_sampling_ratio=0 would fix it. But I think it should be clarified in the documentation to avoid any confusion like this.

sadrahkm · 2024-05-12T17:20:29Z

I think if we want to have negative samples for train/val/test sets, there would be a problem with this issue. Because in that case, we would have to set add_negative_train_samples=True as well as putting neg_sampling_ratio=2.0. By doing this, val/test would have more than 2 labels, as I mentioned in the problem statement.

sadrahkm added the bug label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with RandomEdgeSplit for multilabel edge classification #9262

Error with RandomEdgeSplit for multilabel edge classification #9262

sadrahkm commented Apr 30, 2024 •

edited

keeganq commented May 6, 2024 •

edited

keeganq commented May 6, 2024

sadrahkm commented May 11, 2024

sadrahkm commented May 12, 2024

Error with RandomEdgeSplit for multilabel edge classification #9262

Error with RandomEdgeSplit for multilabel edge classification #9262

Comments

sadrahkm commented Apr 30, 2024 • edited

🐛 Describe the bug

Versions

keeganq commented May 6, 2024 • edited

keeganq commented May 6, 2024

sadrahkm commented May 11, 2024

sadrahkm commented May 12, 2024

sadrahkm commented Apr 30, 2024 •

edited

keeganq commented May 6, 2024 •

edited