🐍 🌟 Hyper-Relational Statement Factory #1117

migalkin · 2022-09-16T17:22:27Z

This PR introduces a series of efforts on integrating hyper-relational graphs (aka RDF* or LPG) into PyKEEN with factories, datasets, and models.

Here is the adaptation of the code of our ISWC'21 paper:

StatementFactory (currently a mix of functions of CoreTriplesFactory and TriplesFactory)
HyperRelationalUnpackedRemoteDataset (the sole purpose of it is to have an explicit argument max_num_qualifier_pairs to limit the maximum statement length
A family of hyper-relational WD50K datasets (with 13%, 33%, 66%, and 100% statements with qualifiers)

Loading statements of various sizes, we now have padding entities/relations, right now they are the last entries of entity_to_id and relation_to_id. They are auxiliary and, for instance, we won't need to build inverse type for the padding relation - this is a bit similar to NodePiece with its auxiliary tokens.

Some caveats:

Input statements in the datasets might generally be of arbitrary length and we do not know this length in advance, but the loading procedure in pandas.read_csv requires a pre-defined number of columns to load (otherwise crashes). So it's either a loading speed with pandas, or dummy for loops (but preserving the whole dataset)
Loading statements into a fixed-size tensor of shape (num_statements, max_len) is memory-inefficient - a lot of entries will be padding indices. We have functions to build sparse edge_index and qualifier_index, but they will be applicable for GNN encoders, not for training/evaluation. Ideally, we'd need some sort of RaggedTensor from TensorFlow, but there is no standard implementation of that in PyTorch

The first "egg" of the project "PyKEEN Hyper-Relational" 😅

migalkin · 2022-09-16T17:26:16Z

Loading of hyper-relational WD50K datasets works, you can already try it with smth like

from pykeen.datasets.wd50k import WD50K_100


def main():
    dataset = WD50K_100(
        create_inverse_triples=True,
        max_num_qualifier_pairs=5,
        eager=True
    )

    print(dataset.num_entities)

if __name__ == '__main__':
    main()

cthoyt · 2022-09-17T18:23:03Z

src/pykeen/triples/utils.py

+        if len(column_remapping) > max_len:
+            raise ValueError("remapping must have length not more than the max statement length")
+
+    # TODO find a way to load files w/o knowing max_len in advance


you can open a file handle then just read the first row, split by the delimiter, then count. then just close the file handle

Ah if it was that easy 😢 Statements in the input files are not sorted "longest to shortest" so the longest statement might appear somewhere in the middle of the file, eg

h, r, t, qr1, qv1 # the first row does not have padding commas ... h, r, t, qr1, qv1, qr2, qv2, qr3, qv3

so we'd still need to read through the whole file to identify the maximum length

cthoyt

Can we please add a small test dataset and demonstrate loading it in unit tests

cthoyt · 2022-09-17T18:25:31Z

src/pykeen/datasets/base.py

@@ -944,3 +945,105 @@ def _get_df(self) -> pd.DataFrame:
            df = df[usecols]

        return df
+
+
+class HyperRelationalUnpackedRemoteDataset(PathDataset):


I think we need a different base class mixin for this kind of dataset to note that the triples factories have to be SatementFactories

mberr

overall, nice to see that hyper-relational statements may finds their way into PyKEEN 🚀

I added a few comments; I think we can merge/re-use some of the new code with existing triple utilities

mberr · 2022-09-25T16:32:09Z

src/pykeen/datasets/wd50k.py

-TRIPLES_VALID_URL = f"{BASE_URL}/triples/valid.txt"
-TRIPLES_TEST_URL = f"{BASE_URL}/triples/test.txt"
-TRIPLES_TRAIN_URL = f"{BASE_URL}/triples/train.txt"
+BASE_URL = "https://raw.githubusercontent.com/migalkin/StarE/master/data/clean"


could make sense to put this Zenodo

mberr · 2022-09-25T16:33:18Z

src/pykeen/datasets/wd50k.py

+        arxiv: 2009.10847
+        github: migalkin/StarE
+    statistics:
+        entities: 47,156 (5,460 qualifier-only)


@cthoyt do we rely somewhere on parsing these statistics to integers? If yes, we should store the second number under a different key

mberr · 2022-09-25T16:34:19Z