🧪🔢 Reproducing results of LiteralE #1226

AntonisKl · 2023-02-13T13:43:22Z

Link to the relevant Bug(s)

This PR fixes #1211.

Dependencies

🐏🚀 Use torch_max_mem for automatic memory optimization in Evaluator #1261

Description of the Change

This PR aims to facilitate the reproduction of the results mentioned in LiteralE's original paper.

Along with fixing #1211, this PR introduces:

Base classes for remote datasets that include numeric attributive triples and have their relation triples packed (zip or tar).
The numeric-literal-extended versions of two datasets: FB15k-237 and YAGO3-10 as per LiteralE's original paper. Their names are FB15k237WithLiterals and YAGO310WithLiterals respectively.
Two optional preprocessing steps (expected in dataset_kwargs) used by datasets that include numeric attributive triples:
1. Numeric triples preprocessing (arg: numeric_triples_preprocessing), e.g. filtering them by their relations using the new filter_triples_by_relations() function (also used in LiteralE's pipeline).
2. Numeric literals preprocessing (arg: numeric_literals_preprocessing), e.g. normalizing them using the new minmax_normalize() function (also used in LiteralE's pipeline).
Pipeline configurations for reproducing LiteralE's original results.

Possible Drawbacks

There should not be any side-effects as this PR adds new optional functionality.

Verification Process

Multiple pipeline executions with the new datasets and with/without the new preprocessing functions via:

The pipeline() method with the appropriate arguments (see example below).

pipeline(
   model='DistMultLiteral',
   dataset="FB15k237WithLiterals",
   dataset_kwargs=dict(create_inverse_triples=True, numeric_literals_preprocessing='minmax', numeric_triples_preprocessing='filter_by_relations', force=True),
   epochs=100,
   stopper='early',
   stopper_kwargs=dict(metric='inverse_harmonic_mean_rank', frequency=3),
   result_tracker='console',
   model_kwargs=dict(embedding_dim=200, input_dropout=0.2),
   loss='BCEWithLogitsLoss',
   training_kwargs=dict(batch_size=128, label_smoothing=0.1),
   optimizer_kwargs=dict(lr=0.001),
   training_loop='LCWATrainingLoop'
)

Notice the usage of force=True in dataset_kwargs is necessary for applying the preprocessing functions in case a previous experiment with the same dataset has been performed.

The pipeline_from_path() method that loads the pipeline configuration from a file (see example below).
```
pipeline_from_path("<path_to_pykeen>/experiments/literale/kristiadi2019_distmult+literale_glin_fb15k237.yaml")
```
<path_to_pykeen> should be replaced with the correct pykeen installation path.

Release Notes

Added necessary datasets, dataset-related components, and pipeline configurations for reproducing LiteralE's original results.

+ base classes for remote datasets that include literals

src/pykeen/triples/triples_numeric_literals_factory.py

cthoyt · 2023-02-13T14:04:44Z

src/pykeen/triples/triples_numeric_literals_factory.py

@@ -76,13 +96,21 @@ def from_path(
        path: Union[str, pathlib.Path, TextIO],
        *,
        path_to_numeric_triples: Union[None, str, pathlib.Path, TextIO] = None,
+        numeric_triples_preprocessing: Optional[Union[str, Callable[[LabeledTriples], LabeledTriples]]] = None,
+        numeric_literals_preprocessing: Optional[Union[str, Callable[[np.ndarray], np.ndarray]]] = None,


I'm skeptical about these pre-processing functions since there are only one example of each. Seems like without a variety, these should just be yes/no flags instead of an extensible system that's going to add one more level of complexity for maitenance. Can you do the following:

Add additional pre-processsing functions for triples/literals

Give specfic examples where they are relevant (both in comment form on this PR and as documentation in code!)

The motivation behind adding these arguments is to allow users to conveniently experiment with new preprocessing functions. Also, there exist published models with different preprocessing methods (e.g. the one mentioned in #1207) that need to be easily integrated to pykeen.

I added example usage integrated in the docstring of the functions that have these arguments.

Also, I added the corresponding ...kwargs parameters to pair with the new preprocessing functions, so that the user can also specify the functions' arguments through the call of pipeline() (see commit dde23df).

cthoyt

Hi @AntonisKl, thanks for submitting this PR. I took a quick pass and I think the idea here is interesting. However, it doesn't yet fit the PyKEEN vibe and will both require some refactoring and some more contextualization (e.g., via documentation).

Could you please make sure you're passing CI? (run tox locally) After that, we can take a second look.

AntonisKl · 2023-02-14T13:24:30Z

Hi @AntonisKl, thanks for submitting this PR. I took a quick pass and I think the idea here is interesting. However, it doesn't yet fit the PyKEEN vibe and will both require some refactoring and some more contextualization (e.g., via documentation).

Could you please make sure you're passing CI? (run tox locally) After that, we can take a second look.

Hi @cthoyt, I appreciate your quick feedback. I answered your review comments, made the required changes and ran tox locally. Please check and let me know if there are more suggested changes.

cthoyt · 2023-02-14T13:32:32Z

@AntonisKl please try again, a bit more carefully this time. There remain many flake8 issues. It won't be a good use of my limited time giving you style feedback nor reading code that's not up to PyKEEN standard. Is there some confusion on how to interpret the results of running tox?

I will note that the CI in the PR is not running properly, this is also frustrating. I will try and get it working properly as well

AntonisKl · 2023-02-14T13:58:57Z

@cthoyt when I ran tox I saw the results but because there were several errors that were misleading (e.g. HTTP Error 404: Not Found), I thought that they were not related to my changes. So, I will look into the tox output more carefully and perform the required changes.

AntonisKl · 2023-02-27T13:39:42Z

@cthoyt I also added the required random seeds for reproducing LiteralE and fixed tox issues. Let me know if you have further feedback.

EDIT: I got informed by an author of LiteralE about the early stopping mechanism that they incorporated, so I will make the necessary changes and re-request review.

…to reproducing-literalE

src/pykeen/datasets/freebase.py

Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>

ID-based mapping of numeric triples

AntonisKl added 10 commits February 7, 2023 13:15

Add numeric literals preprocessing step

45f2ea5

Add literal versions of FB15k237 and YAGO3-10

7126402

+ base classes for remote datasets that include literals

Add min-max normalization util function (used by LiteralE)

730b4c0

Allow str or Callable for numeric literals preprocessing

9382173

Add declaration of minmax_normalize() to utils.py

ff46153

Add numeric triples preprocessing step and add filter util function

79cc87b

Add comments and refactor util function

c9a8b84

Add pipeline configurations for LiteralE

90a53a3

Add literal info to tar literal dataset and refactor summary method

da231c9

Add comments to remote_literal_base.py

3eb7f49

cthoyt reviewed Feb 13, 2023

View reviewed changes

src/pykeen/triples/triples_numeric_literals_factory.py Outdated Show resolved Hide resolved

cthoyt reviewed Feb 13, 2023

View reviewed changes

src/pykeen/triples/triples_numeric_literals_factory.py Outdated Show resolved Hide resolved

cthoyt reviewed Feb 13, 2023

View reviewed changes

AntonisKl added 5 commits February 13, 2023 16:37

Replace 2 callable types with shorter aliases

4cd3794

Replace optional arg type hint with existing alias

ad03785

Use FunctionResolver for dataset preprocessing functions

1ed5351

Format files and add comments to triples_numeric_literals_factory.py

30277f6

Add kwargs for literal dataset preprocessing functions

dde23df

cthoyt self-requested a review February 14, 2023 13:32

AntonisKl added 5 commits February 23, 2023 15:01

Fix flake8 issues

6a55076

Merge branch 'master' into reproducing-literalE

53f3328

Initialize random seeds and num_workers based on LiteralE repo

3520fdd

Fix documentation format of new dataset classes

c25b5fe

Fix GatedCombination kwargs passing in DistMultLiteralGated

72fba90

Add info about the 2 new literal datasets to README.md

b3871a0

AntonisKl added 2 commits May 22, 2023 15:27

Add constant for URL of FB15k-237 relational triples

9225c2c

Merge branch 'reproducing-literalE' of github.com:AntonisKl/pykeen in…

bf9d027

…to reproducing-literalE

cthoyt reviewed May 22, 2023

View reviewed changes

src/pykeen/datasets/freebase.py Outdated Show resolved Hide resolved

AntonisKl and others added 27 commits May 22, 2023 15:38

Refactor function for adding literals info to dataset summary

fb26f86

Add permalinks and constant for URLs of YAGO3-10 triples

9322884

Change URLs to raw GitHub URLs for FB15k-237 and YAGO3-10

e1d908b

implement ID-based mapping of numeric triples

fb79e89

remove unused method

4423df6

use torch.get_default_dtype

2ea5214

update configs

fd2e89e

log relation/triple keep ratios

b71ffa5

allow passing string regex

2981979

update literal datasets

f1c0ce1

more renamings

d5448e4

update experiment configs

58063ee

update experiment configs

999132c

update url

60d46b6

drop invalid entity labels

16ad81b

update experiment configs

9f0d222

Import annotations from __future__ for Py<3.10

16e525a

Update src/pykeen/datasets/remote_literal_base.py

0a41499

Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>

Update triples_numeric_literals_factory.py

093c443

Update src/pykeen/triples/triples_numeric_literals_factory.py

edcdff0

Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>

Update src/pykeen/triples/triples_numeric_literals_factory.py

e4975fd

Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>

Update src/pykeen/datasets/remote_literal_base.py

ba38709

Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>

Update src/pykeen/datasets/remote_literal_base.py

7ca3809

Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>

Merge pull request #1 from pykeen/id-based-relation-filter

2d49bf5

ID-based mapping of numeric triples

Merge remote-tracking branch 'public/master' into reproducing-literalE

e67a966

Fix PR-related flake8 issues

528fac9

Fix init arguments of NumericPathDataset class

9639070

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧪🔢 Reproducing results of LiteralE #1226

🧪🔢 Reproducing results of LiteralE #1226

AntonisKl commented Feb 13, 2023 •

edited by mberr

cthoyt Feb 13, 2023

AntonisKl Feb 13, 2023

AntonisKl Feb 14, 2023 •

edited

cthoyt left a comment •

edited

AntonisKl commented Feb 14, 2023

cthoyt commented Feb 14, 2023

AntonisKl commented Feb 14, 2023

AntonisKl commented Feb 27, 2023 •

edited

🧪🔢 Reproducing results of LiteralE #1226

Are you sure you want to change the base?

🧪🔢 Reproducing results of LiteralE #1226

Conversation

AntonisKl commented Feb 13, 2023 • edited by mberr

Link to the relevant Bug(s)

Dependencies

Description of the Change

Possible Drawbacks

Verification Process

Release Notes

cthoyt Feb 13, 2023

Choose a reason for hiding this comment

AntonisKl Feb 13, 2023

Choose a reason for hiding this comment

AntonisKl Feb 14, 2023 • edited

Choose a reason for hiding this comment

cthoyt left a comment • edited

Choose a reason for hiding this comment

AntonisKl commented Feb 14, 2023

cthoyt commented Feb 14, 2023

AntonisKl commented Feb 14, 2023

AntonisKl commented Feb 27, 2023 • edited

AntonisKl commented Feb 13, 2023 •

edited by mberr

AntonisKl Feb 14, 2023 •

edited

cthoyt left a comment •

edited

AntonisKl commented Feb 27, 2023 •

edited