Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🧪🔢 Reproducing results of LiteralE #1226

Open
wants to merge 84 commits into
base: master
Choose a base branch
from

Conversation

AntonisKl
Copy link
Contributor

@AntonisKl AntonisKl commented Feb 13, 2023

Link to the relevant Bug(s)

This PR fixes #1211.

Dependencies

Description of the Change

This PR aims to facilitate the reproduction of the results mentioned in LiteralE's original paper.

Along with fixing #1211, this PR introduces:

  1. Base classes for remote datasets that include numeric attributive triples and have their relation triples packed (zip or tar).
  2. The numeric-literal-extended versions of two datasets: FB15k-237 and YAGO3-10 as per LiteralE's original paper. Their names are FB15k237WithLiterals and YAGO310WithLiterals respectively.
  3. Two optional preprocessing steps (expected in dataset_kwargs) used by datasets that include numeric attributive triples:
    1. Numeric triples preprocessing (arg: numeric_triples_preprocessing), e.g. filtering them by their relations using the new filter_triples_by_relations() function (also used in LiteralE's pipeline).
    2. Numeric literals preprocessing (arg: numeric_literals_preprocessing), e.g. normalizing them using the new minmax_normalize() function (also used in LiteralE's pipeline).
  4. Pipeline configurations for reproducing LiteralE's original results.

Possible Drawbacks

There should not be any side-effects as this PR adds new optional functionality.

Verification Process

Multiple pipeline executions with the new datasets and with/without the new preprocessing functions via:

  1. The pipeline() method with the appropriate arguments (see example below).

    pipeline(
       model='DistMultLiteral',
       dataset="FB15k237WithLiterals",
       dataset_kwargs=dict(create_inverse_triples=True, numeric_literals_preprocessing='minmax', numeric_triples_preprocessing='filter_by_relations', force=True),
       epochs=100,
       stopper='early',
       stopper_kwargs=dict(metric='inverse_harmonic_mean_rank', frequency=3),
       result_tracker='console',
       model_kwargs=dict(embedding_dim=200, input_dropout=0.2),
       loss='BCEWithLogitsLoss',
       training_kwargs=dict(batch_size=128, label_smoothing=0.1),
       optimizer_kwargs=dict(lr=0.001),
       training_loop='LCWATrainingLoop'
    )

    Notice the usage of force=True in dataset_kwargs is necessary for applying the preprocessing functions in case a previous experiment with the same dataset has been performed.

  2. The pipeline_from_path() method that loads the pipeline configuration from a file (see example below).

    pipeline_from_path("<path_to_pykeen>/experiments/literale/kristiadi2019_distmult+literale_glin_fb15k237.yaml")

    <path_to_pykeen> should be replaced with the correct pykeen installation path.

Release Notes

Added necessary datasets, dataset-related components, and pipeline configurations for reproducing LiteralE's original results.

@@ -76,13 +96,21 @@ def from_path(
path: Union[str, pathlib.Path, TextIO],
*,
path_to_numeric_triples: Union[None, str, pathlib.Path, TextIO] = None,
numeric_triples_preprocessing: Optional[Union[str, Callable[[LabeledTriples], LabeledTriples]]] = None,
numeric_literals_preprocessing: Optional[Union[str, Callable[[np.ndarray], np.ndarray]]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical about these pre-processing functions since there are only one example of each. Seems like without a variety, these should just be yes/no flags instead of an extensible system that's going to add one more level of complexity for maitenance. Can you do the following:

  1. Add additional pre-processsing functions for triples/literals
  2. Give specfic examples where they are relevant (both in comment form on this PR and as documentation in code!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation behind adding these arguments is to allow users to conveniently experiment with new preprocessing functions. Also, there exist published models with different preprocessing methods (e.g. the one mentioned in #1207) that need to be easily integrated to pykeen.

Copy link
Contributor Author

@AntonisKl AntonisKl Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added example usage integrated in the docstring of the functions that have these arguments.

Also, I added the corresponding ...kwargs parameters to pair with the new preprocessing functions, so that the user can also specify the functions' arguments through the call of pipeline() (see commit dde23df).

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AntonisKl, thanks for submitting this PR. I took a quick pass and I think the idea here is interesting. However, it doesn't yet fit the PyKEEN vibe and will both require some refactoring and some more contextualization (e.g., via documentation).

Could you please make sure you're passing CI? (run tox locally) After that, we can take a second look.

@AntonisKl
Copy link
Contributor Author

Hi @AntonisKl, thanks for submitting this PR. I took a quick pass and I think the idea here is interesting. However, it doesn't yet fit the PyKEEN vibe and will both require some refactoring and some more contextualization (e.g., via documentation).

Could you please make sure you're passing CI? (run tox locally) After that, we can take a second look.

Hi @cthoyt, I appreciate your quick feedback. I answered your review comments, made the required changes and ran tox locally. Please check and let me know if there are more suggested changes.

@cthoyt
Copy link
Member

cthoyt commented Feb 14, 2023

@AntonisKl please try again, a bit more carefully this time. There remain many flake8 issues. It won't be a good use of my limited time giving you style feedback nor reading code that's not up to PyKEEN standard. Is there some confusion on how to interpret the results of running tox?

I will note that the CI in the PR is not running properly, this is also frustrating. I will try and get it working properly as well

@cthoyt cthoyt self-requested a review February 14, 2023 13:32
@AntonisKl
Copy link
Contributor Author

@cthoyt when I ran tox I saw the results but because there were several errors that were misleading (e.g. HTTP Error 404: Not Found), I thought that they were not related to my changes. So, I will look into the tox output more carefully and perform the required changes.

@AntonisKl
Copy link
Contributor Author

AntonisKl commented Feb 27, 2023

@cthoyt I also added the required random seeds for reproducing LiteralE and fixed tox issues. Let me know if you have further feedback.

EDIT: I got informed by an author of LiteralE about the early stopping mechanism that they incorporated, so I will make the necessary changes and re-request review.

AntonisKl and others added 27 commits May 22, 2023 15:38
Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>
Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>
Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>
Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>
Co-authored-by: Antonis Klironomos <antonisklironomos@gmail.com>
ID-based mapping of numeric triples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected metric values during reproduction of LiteralE results
3 participants