How to evaluate metrics on many subsets with multiple models efficiently #1360
-
For a project, I need to evaluate some metrics on many small subsets of the Wikidata5M test set. Specifically, I need evaluation metrics (such as Hits@K, arithmetic mean rank etc.) for every predicate of the test set to decide upon which model to use for prediction. To this end, I trained the ComplEx, DistMult, SimplE and TransE models with an embedding dimension of 32 and split the test set of Wikidata5M into predicate groups like this: def get_test_set_per_predicate(test_set_file):
test_set = pd.read_csv(test_set_file, sep="\t", names=['S', 'P', 'O'], header=None) # [subject, predicate, object]
return test_set.groupby('P')
I'm loading the trained models like this: def get_trained_models():
return {
'ComplEx': {
'model': torch.load('embeddings/ComplEx/trained_model.pkl'),
'factory': TriplesFactory.from_path_binary('embeddings/ComplEx/training_triples')
},
'DistMult': {
...
},
'SimplE': {
...
},
'TransE': {
...
}
} The test set has 211 unique predicates, so the resulting dataframe consists of 211 groups. For evaluation, I iterate over all testing subsets and all trained models and use test_splits = get_test_set_per_predicate('knowledge_graph/wikidata5m_transductive_test.txt')
trained_models = get_trained_models() # Dict of loaded model instances and triple factories
# Cycle through all predicate groups
for predicate, triples in test_splits:
evaluator = RankBasedEvaluator()
# Converting subset dataframe into triples factory
test_factory = TriplesFactory.from_labeled_triples(triples=triples.values)
# Cycle through all trained models
metrics = pd.DataFrame() # Dataframe for accumulating results
for model_name, result in trained_models.items():
model = result['model']
training_factory = result['factory'] # This is the loaded factory of the training data
model_metrics = evaluator.evaluate(
model=model,
mapped_triples=test_factory.mapped_triples,
additional_filter_triples=[
dataset.training.mapped_triples, # dataset is the Wikidata5M instance of the library
dataset.validation.mapped_triples
]
).to_dict()
flattened_metrics = pd.concat([
# Aggregating optimistic, realistic and pessimistic metrics into one dataframe
# and adding new column for metric type
])
flattened_metrics['P'] = predicate
flattened_metrics['model'] = model_name
metrics = pd.concat([metrics, flattened_metrics]) One evaluation run takes about 1:30 mins, which would sum up to about 21 hours for all 211 predicates and 4 models. This is very inefficient, and I'm asking whether there is a better way of evaluating so many subsets or whether it is possible to get the individual metrics from the evaluator. I tried to compute the metrics myself using Is there a more efficient way to compute this or am I missing something? Also, please let me know if there are errors in the code. Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @mortenterhart , there is a parameter Below I provide a commented example that should be easy to adapt to your case. import numpy
from pykeen.datasets import get_dataset
from pykeen.datasets.nations import NATIONS_TEST_PATH
from pykeen.evaluation import RankBasedEvaluator, RankBasedMetricResults
from pykeen.evaluation.rank_based_evaluator import _iter_ranks
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory
# note: replace this by loading your custom dataset & model; it's only here to be self-contained
dataset = get_dataset(dataset="nations")
result = pipeline(dataset=dataset, model="complex", training_kwargs=dict(num_epochs=1))
model = result.model
# note: it is important to re-use the same entity/relation-to-id mapping!
test_factory = TriplesFactory.from_path(
path=NATIONS_TEST_PATH, entity_to_id=dataset.training.entity_to_id, relation_to_id=dataset.training.relation_to_id
)
# instantiate evaluator and evaluate on the full test set once
# afterwards, we'll have evaluator.ranks & evaluator.num_candidates filled
evaluator = RankBasedEvaluator(clear_on_finalize=False)
evaluator.evaluate(
model=model,
mapped_triples=test_factory.mapped_triples,
additional_filter_triples=[
dataset.training.mapped_triples,
dataset.validation.mapped_triples,
],
)
# now comes the custom part:
# evaluator.ranks & evaluator.num_candidates contains rank / num candidates for each individual evaluation batch
# importantly, they are in the same order as the evaluation triples
# hence, we can perform the selection / grouping by relation directly on these!
# we'll compose some intermediate dataframes to make these operations easier
df_temp = test_factory.tensor_to_df(
tensor=test_factory.mapped_triples,
**{"-".join(("rank",) + key): numpy.concatenate(value) for key, value in evaluator.ranks.items()},
**{"-".join(("num_candidates", key)): numpy.concatenate(value) for key, value in evaluator.num_candidates.items()},
)
for (relation_id, relation_label), group in df_temp.groupby(by=["relation_id", "relation_label"]):
# now reconstruct the dictionaries
relation_ranks = {}
relation_num_candidates = {}
for column in group.columns:
if column.startswith("rank-"):
relation_ranks[tuple(column.split("-"))[1:]] = [group[column].values]
elif column.startswith("num_candidates"):
relation_num_candidates[tuple(column.split("-"))[1]] = [group[column].values]
# and calculate metrics
results = RankBasedMetricResults.from_ranks(
metrics=evaluator.metrics,
rank_and_candidates=_iter_ranks(ranks=relation_ranks, num_candidates=relation_num_candidates),
)
# here, you could also store the results in any other way
print(relation_label, results.get_metric("mrr")) |
Beta Was this translation helpful? Give feedback.
-
Since these entities do not occur in any training triple, it does not make much sense to include them in an transductive evaluation (i.e., when using models that rely on learned representations for a given index, cf. our tutorial on inductive link prediction). Thus, we exclude such entities from the evaluation. In theory, you could enforce additional entries in the entity to id mapping; this would have to be done before training the respective models, and the models would only ever see these entities as part of negative samples.
Assuming that this is the |
Beta Was this translation helpful? Give feedback.
Hi @mortenterhart ,
there is a parameter
clear_on_finalize
ofRankBasedEvaluator
that should allow you to evaluate these metrics more efficiently.Below I provide a commented example that should be easy to adapt to your case.