Link prediction in a case that the edge is not presented in training #1381

hildade · 2024-03-27T11:28:26Z

hildade
Mar 27, 2024

I am using pyKEEN (1.10.2) for link prediction. My net has 6 types of edges: (P-P, P-B, D-I, D-P, I-P and B-B). I would like to train RotatE (for example) to predict D-I edges, keeping all other 5 edge types in training.
The model currently runs on transductive link prediction.
A. Can you please explain RotatE score calculation, *considering the exclusion of edge type D-I from training, and including (only) D-I in validation and testing? (note that nodes D and I are represented in the training)
B. How can I verify that the embeddings are indeed used for the link prediction? can I manually calculate and reproduce the test scores from the embeddings?
C. Theoretically, Is inductive prediction preferred to transductive in this case?

mberr · 2024-03-27T18:25:14Z

mberr
Mar 27, 2024
Maintainer

Hi @hildade ,

While PyKEEN theoretically supports unseen relation (by using relation representations that are predicted inductively, e.g. from features of the relation such as a label), none of the provided model configurations support this out-of-the-box.

However, it is relatively easy to build such a model using the components provided by PyKEEN. Here is a small example that uses a (bert-base-uncased) pre-trained Transformer to encode relation labels and trains a DistMult model on top of it:

import random

from pykeen.datasets import get_dataset
from pykeen.datasets.base import Dataset, EagerDataset
from pykeen.nn.representation import TextRepresentation
from pykeen.pipeline import pipeline


def inductive_relation_split(dataset: Dataset, seen_fraction: float = 0.50, seed: int = 42) -> Dataset:
    # note: this is a very simple split not taking care of e.g., all test entities occuring in training etc.
    training = dataset.training
    relation_ids = list(range(training.num_relations))
    rng = random.Random(seed)
    rng.shuffle(relation_ids)
    training_relations = relation_ids[: int(seen_fraction * dataset.num_relations)]
    return EagerDataset(
        training=training.new_with_restriction(relations=training_relations),
        validation=(
            None
            if dataset.validation is None
            else dataset.validation.new_with_restriction(relations=training_relations, invert_relation_selection=True)
        ),
        testing=dataset.testing.new_with_restriction(relations=training_relations, invert_relation_selection=True),
    )


# we use a dataset which provides relation labels, which we'll use as features
dataset = get_dataset(dataset="nations")
inductive_dataset = inductive_relation_split(dataset=dataset)
inductive_dataset.summarize()

# build inductive relation representations using a Transformer encoder on top of relation labels
relation_representation = TextRepresentation.from_dataset(inductive_dataset, for_entities=False, encoder="transformer")

# now train a model with DistMult interaction (and standard entity representations; embedding table)
result = pipeline(
    model="DistMult",
    dataset=inductive_dataset,
    model_kwargs={"relation_representations": relation_representation, "embedding_dim": relation_representation.shape},
)
# print a few metrics
for metric in ["adjusted_arithmetic_mean_rank_index", "hits_at_1", "hits_at_10"]:
    print(f"{metric:48} {result.get_metric(metric):.3f}")

Note that I created a very simple inductive relation split, which does not take care of e.g. having all test entities also been shown in training.

Without any further tuning, I get the following results

adjusted_arithmetic_mean_rank_index              0.227
hits_at_1                                        0.153
hits_at_10                                       0.941

While not perfect, the positive adjusted arithmetic mean rank index highlights that we obtain a performance which better than random.

4 replies

hildade May 23, 2024
Author

Thanks Max. In this code, the relation embeddings initialized\defined by the transformer encoder are used. Are these relation embeddings updated during the training of the DistMult model to optimize the model\minimize the loss function?

mberr May 26, 2024
Maintainer

TextRepresentation is a torch Module which contains the full transformer encoder as sub-model. Thus, by default, those parameters are trainable.

If you do not need/want that, take a look at pykeen.nn.init.LabelBasedInitializer which instead encodes the labels only once with a text encoder and then uses the corresponding vectors are initialization. Notice that with default settings this will also give you trainable embeddings (i.e., the vectors may deviate during training and are no longer bound to be computable by a Transformer text encoder).

hildade May 30, 2024
Author

Thanks max. while attempting to use the RotatE model with text-based relation representations derived from a transformer encoder. I am unsure if RotatE can handle these representations directly, or if they need to be converted into a specific format (like numerical vectors). Could you provide guidance on how to properly set up relation representations for RotatE in PyKEEN?

mberr May 30, 2024
Maintainer

RotatE uses complex vectors to represent both entities and relations. The relation vector is used as a rotation in the complex plane. This is not directly compatible with the output of the transformer text encoder. Thus, I would suggest to add an additional (learnable) transformation in between

import random
from torch import nn
import torch

from pykeen.datasets import get_dataset
from pykeen.datasets.base import Dataset, EagerDataset
from pykeen.nn.representation import TextRepresentation, TransformedRepresentation
from pykeen.pipeline import pipeline
from pykeen.utils import complex_normalize


def inductive_relation_split(dataset: Dataset, seen_fraction: float = 0.50, seed: int = 42) -> Dataset: ...

# we use a dataset which provides relation labels, which we'll use as features
dataset = get_dataset(dataset="nations")
inductive_dataset = inductive_relation_split(dataset=dataset)
inductive_dataset.summarize()

# build inductive relation representations using a Transformer encoder on top of relation labels
relation_representation = TextRepresentation.from_dataset(inductive_dataset, for_entities=False, encoder="transformer")


# RotatE uses complex vectors for both entity and relation representations
# relation representations represent a rotation in the complex plane
# -> this is quite different to the representations we directly obtain from the Transformer
# -> use a learned transformation in between, here a 2-layer MLP
class TransformationToComplex(nn.Sequential):
    def __init__(self, embedding_dim: int, factor: int = 4):
        hidden_dim = embedding_dim * factor
        super().__init__(
            nn.Linear(in_features=embedding_dim, out_features=hidden_dim),
            nn.ReLU(),
            nn.Linear(in_features=hidden_dim, out_features=2 * embedding_dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = super().forward(x)
        # view as a complex vector
        x = torch.view_as_complex(x.view(*x.shape[:-1], -1, 2))
        # important: normalize length, cf. relation_contrainer in RotatE
        return complex_normalize(x)


embedding_dim = relation_representation.shape[0]
relation_representation = TransformedRepresentation(
    transformation=TransformationToComplex(embedding_dim=embedding_dim),
    base=relation_representation,
)

# now train a model with DistMult interaction (and standard entity representations; embedding table)
result = pipeline(
    model="RotatE",
    dataset=inductive_dataset,
    model_kwargs={"relation_representations": relation_representation, "embedding_dim": embedding_dim},
)
# print a few metrics
for metric in ["adjusted_arithmetic_mean_rank_index", "hits_at_1", "hits_at_10"]:
    print(f"{metric:48} {result.get_metric(metric):.3f}")

As written before, it might make sense to replace the TextRepresentation by a plain Embedding with LabelBasedInitializer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link prediction in a case that the edge is not presented in training #1381

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Link prediction in a case that the edge is not presented in training #1381

hildade Mar 27, 2024

Replies: 1 comment · 4 replies

mberr Mar 27, 2024 Maintainer

hildade May 23, 2024 Author

mberr May 26, 2024 Maintainer

hildade May 30, 2024 Author

mberr May 30, 2024 Maintainer

hildade
Mar 27, 2024

Replies: 1 comment 4 replies

mberr
Mar 27, 2024
Maintainer

hildade May 23, 2024
Author

mberr May 26, 2024
Maintainer

hildade May 30, 2024
Author

mberr May 30, 2024
Maintainer