How to evaluate metrics on many subsets with multiple models efficiently #1360

mortenterhart · 2024-01-13T17:51:12Z

mortenterhart
Jan 13, 2024

For a project, I need to evaluate some metrics on many small subsets of the Wikidata5M test set. Specifically, I need evaluation metrics (such as Hits@K, arithmetic mean rank etc.) for every predicate of the test set to decide upon which model to use for prediction. To this end, I trained the ComplEx, DistMult, SimplE and TransE models with an embedding dimension of 32 and split the test set of Wikidata5M into predicate groups like this:

def get_test_set_per_predicate(test_set_file):
    test_set = pd.read_csv(test_set_file, sep="\t", names=['S', 'P', 'O'], header=None)  # [subject, predicate, object]
    return test_set.groupby('P')

Result:
              S      P         O
0      Q7965079    P27       Q16
1      Q6719921    P31    Q11446
2      Q4925109   P175  Q5165801
3     Q11010724   P734    Q59853
4      Q1236794    P31  Q1134686
...         ...    ...       ...
5035   Q2199494  P1995    Q83042
5037   Q1453427  P1435   Q385405
5067  Q14709462   P149   Q176483
5069    Q776272  P2012   Q234138
5126  Q15507650  P2868   Q810198

[684 rows x 3 columns]

I'm loading the trained models like this:

def get_trained_models():
    return {
        'ComplEx': {
            'model': torch.load('embeddings/ComplEx/trained_model.pkl'),
            'factory': TriplesFactory.from_path_binary('embeddings/ComplEx/training_triples')
        },
        'DistMult': {
            ...
        },
        'SimplE': {
            ...
        },
        'TransE': {
            ...
        }
    }

The test set has 211 unique predicates, so the resulting dataframe consists of 211 groups. For evaluation, I iterate over all testing subsets and all trained models and use RankBasedEvaluator.evaluate() for computing the metrics per predicate per model. This is the simplified code:

test_splits = get_test_set_per_predicate('knowledge_graph/wikidata5m_transductive_test.txt')
trained_models = get_trained_models()  # Dict of loaded model instances and triple factories

# Cycle through all predicate groups
for predicate, triples in test_splits:
    evaluator = RankBasedEvaluator()

    # Converting subset dataframe into triples factory
    test_factory = TriplesFactory.from_labeled_triples(triples=triples.values)

    # Cycle through all trained models
    metrics = pd.DataFrame()   # Dataframe for accumulating results
    for model_name, result in trained_models.items():
        model = result['model']
        training_factory = result['factory']   # This is the loaded factory of the training data

        model_metrics = evaluator.evaluate(
            model=model,
            mapped_triples=test_factory.mapped_triples,
            additional_filter_triples=[
                dataset.training.mapped_triples,     # dataset is the Wikidata5M instance of the library
                dataset.validation.mapped_triples
            ]
        ).to_dict()
        flattened_metrics = pd.concat([
            # Aggregating optimistic, realistic and pessimistic metrics into one dataframe
            # and adding new column for metric type
        ])

        flattened_metrics['P'] = predicate
        flattened_metrics['model'] = model_name
        metrics = pd.concat([metrics, flattened_metrics])

One evaluation run takes about 1:30 mins, which would sum up to about 21 hours for all 211 predicates and 4 models. This is very inefficient, and I'm asking whether there is a better way of evaluating so many subsets or whether it is possible to get the individual metrics from the evaluator. I tried to compute the metrics myself using predict_triples(), but haven't managed to do it so far, due to the elaborate prediction post-processing, and I don't want nonsense results in the end.

Is there a more efficient way to compute this or am I missing something? Also, please let me know if there are errors in the code.

Environment

Key	Value
OS	posix
Platform	Linux
Release	6.5.0-14-generic
Time	Sat Jan 13 18:43:33 2024
Python	3.10.13
PyKEEN	1.10.1
PyKEEN Hash	UNHASHED
PyKEEN Branch
PyTorch	2.1.2
CUDA Available?	true
CUDA Version	12.1
cuDNN Version	8902

Answered by mberr

Jan 13, 2024

Hi @mortenterhart ,

there is a parameter clear_on_finalize of RankBasedEvaluator that should allow you to evaluate these metrics more efficiently.

Below I provide a commented example that should be easy to adapt to your case.

import numpy

from pykeen.datasets import get_dataset
from pykeen.datasets.nations import NATIONS_TEST_PATH
from pykeen.evaluation import RankBasedEvaluator, RankBasedMetricResults
from pykeen.evaluation.rank_based_evaluator import _iter_ranks
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory

# note: replace this by loading your custom dataset & model; it's only here to be self-contained
dataset = get_dataset(dataset="nations")
result = p…

View full answer

mberr · 2024-01-13T21:40:12Z

mberr
Jan 13, 2024
Maintainer

Hi @mortenterhart ,

there is a parameter clear_on_finalize of RankBasedEvaluator that should allow you to evaluate these metrics more efficiently.

Below I provide a commented example that should be easy to adapt to your case.

import numpy

from pykeen.datasets import get_dataset
from pykeen.datasets.nations import NATIONS_TEST_PATH
from pykeen.evaluation import RankBasedEvaluator, RankBasedMetricResults
from pykeen.evaluation.rank_based_evaluator import _iter_ranks
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory

# note: replace this by loading your custom dataset & model; it's only here to be self-contained
dataset = get_dataset(dataset="nations")
result = pipeline(dataset=dataset, model="complex", training_kwargs=dict(num_epochs=1))
model = result.model

# note: it is important to re-use the same entity/relation-to-id mapping!
test_factory = TriplesFactory.from_path(
    path=NATIONS_TEST_PATH, entity_to_id=dataset.training.entity_to_id, relation_to_id=dataset.training.relation_to_id
)

# instantiate evaluator and evaluate on the full test set once
# afterwards, we'll have evaluator.ranks & evaluator.num_candidates filled
evaluator = RankBasedEvaluator(clear_on_finalize=False)
evaluator.evaluate(
    model=model,
    mapped_triples=test_factory.mapped_triples,
    additional_filter_triples=[
        dataset.training.mapped_triples,
        dataset.validation.mapped_triples,
    ],
)

# now comes the custom part:
# evaluator.ranks & evaluator.num_candidates contains rank / num candidates for each individual evaluation batch
# importantly, they are in the same order as the evaluation triples
# hence, we can perform the selection / grouping by relation directly on these!
# we'll compose some intermediate dataframes to make these operations easier
df_temp = test_factory.tensor_to_df(
    tensor=test_factory.mapped_triples,
    **{"-".join(("rank",) + key): numpy.concatenate(value) for key, value in evaluator.ranks.items()},
    **{"-".join(("num_candidates", key)): numpy.concatenate(value) for key, value in evaluator.num_candidates.items()},
)

for (relation_id, relation_label), group in df_temp.groupby(by=["relation_id", "relation_label"]):
    # now reconstruct the dictionaries
    relation_ranks = {}
    relation_num_candidates = {}
    for column in group.columns:
        if column.startswith("rank-"):
            relation_ranks[tuple(column.split("-"))[1:]] = [group[column].values]
        elif column.startswith("num_candidates"):
            relation_num_candidates[tuple(column.split("-"))[1]] = [group[column].values]
    # and calculate metrics
    results = RankBasedMetricResults.from_ranks(
        metrics=evaluator.metrics,
        rank_and_candidates=_iter_ranks(ranks=relation_ranks, num_candidates=relation_num_candidates),
    )
    # here, you could also store the results in any other way
    print(relation_label, results.get_metric("mrr"))

1 reply

mortenterhart Jan 14, 2024
Author

Hi @mberr,

thank you for your help and detailed solution. I haven't found a good way of doing it myself in the docs, so that really helps me out here.

I adapted your solution to my dataset and models and it works like a charm. It's much faster than my original code and only took 2 hours. I'm aggregating relevant metrics in a big dataframe for all relations and models now and save it to a file for later use. I couldn't have come up with this approach myself, since I don't know the internals of this library too much.

During evaluation, I'm getting this warning:

You're trying to map triples with 156 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 156 from 5133 triples were filtered out

There are entities in the testing set that don't have a mapping, since the mapping was created from the training set. Is it possible to extend the existing mapping to include these entities from the test set?

Also, I'm getting median ranks and arithmetic mean ranks between 700,000 and 1,200,000 all over the place. I'm not sure if these are reasonable ranks for a dataset like Wikidata5M.

mberr · 2024-01-16T21:02:09Z

mberr
Jan 16, 2024
Maintainer

There are entities in the testing set that don't have a mapping, since the mapping was created from the training set. Is it possible to extend the existing mapping to include these entities from the test set?

Since these entities do not occur in any training triple, it does not make much sense to include them in an transductive evaluation (i.e., when using models that rely on learned representations for a given index, cf. our tutorial on inductive link prediction). Thus, we exclude such entities from the evaluation.

In theory, you could enforce additional entries in the entity to id mapping; this would have to be done before training the respective models, and the models would only ever see these entities as part of negative samples.

Also, I'm getting median ranks and arithmetic mean ranks between 700,000 and 1,200,000 all over the place. I'm not sure if these are reasonable ranks for a dataset like Wikidata5M.

Assuming that this is the Wikidata5M dataset which is available as one of PyKEEN's dataset, it has 4_594_149 entities. So 700k - 1.2M is better than you would expect from a random scoring model (with an expected arithmetic mean rank of around 4.5M/2 ~= 2.25M though not very promising given that Wikidata5M is built from Wikidata which contains some relational patterns. Does this hold across all models you evaluate? And is it the same for all relations?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to evaluate metrics on many subsets with multiple models efficiently #1360

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

How to evaluate metrics on many subsets with multiple models efficiently #1360

mortenterhart Jan 13, 2024

Environment

Replies: 2 comments · 1 reply

mberr Jan 13, 2024 Maintainer

mortenterhart Jan 14, 2024 Author

mberr Jan 16, 2024 Maintainer

mortenterhart
Jan 13, 2024

Replies: 2 comments 1 reply

mberr
Jan 13, 2024
Maintainer

mortenterhart Jan 14, 2024
Author

mberr
Jan 16, 2024
Maintainer