-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐍 🌟 Hyper-Relational Statement Factory #1117
base: master
Are you sure you want to change the base?
Conversation
Loading of hyper-relational WD50K datasets works, you can already try it with smth like from pykeen.datasets.wd50k import WD50K_100
def main():
dataset = WD50K_100(
create_inverse_triples=True,
max_num_qualifier_pairs=5,
eager=True
)
print(dataset.num_entities)
if __name__ == '__main__':
main() |
if len(column_remapping) > max_len: | ||
raise ValueError("remapping must have length not more than the max statement length") | ||
|
||
# TODO find a way to load files w/o knowing max_len in advance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can open a file handle then just read the first row, split by the delimiter, then count. then just close the file handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah if it was that easy 😢 Statements in the input files are not sorted "longest to shortest" so the longest statement might appear somewhere in the middle of the file, eg
h, r, t, qr1, qv1 # the first row does not have padding commas
...
h, r, t, qr1, qv1, qr2, qv2, qr3, qv3
so we'd still need to read through the whole file to identify the maximum length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add a small test dataset and demonstrate loading it in unit tests
src/pykeen/datasets/base.py
Outdated
@@ -944,3 +945,105 @@ def _get_df(self) -> pd.DataFrame: | |||
df = df[usecols] | |||
|
|||
return df | |||
|
|||
|
|||
class HyperRelationalUnpackedRemoteDataset(PathDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a different base class mixin for this kind of dataset to note that the triples factories have to be SatementFactories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, nice to see that hyper-relational statements may finds their way into PyKEEN 🚀
I added a few comments; I think we can merge/re-use some of the new code with existing triple utilities
TRIPLES_VALID_URL = f"{BASE_URL}/triples/valid.txt" | ||
TRIPLES_TEST_URL = f"{BASE_URL}/triples/test.txt" | ||
TRIPLES_TRAIN_URL = f"{BASE_URL}/triples/train.txt" | ||
BASE_URL = "https://raw.githubusercontent.com/migalkin/StarE/master/data/clean" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could make sense to put this Zenodo
arxiv: 2009.10847 | ||
github: migalkin/StarE | ||
statistics: | ||
entities: 47,156 (5,460 qualifier-only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cthoyt do we rely somewhere on parsing these statistics to integers? If yes, we should store the second number under a different key
logger = logging.getLogger(__name__) | ||
|
||
INVERSE_SUFFIX = "_inverse" | ||
TRIPLES_DF_COLUMNS = ("head_id", "head_label", "relation_id", "relation_label", "tail_id", "tail_label") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These may already exist as constants either in pykeen/constants.py
or pykeen/typing.py
entity_to_id, | ||
relation_to_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entity_to_id, | |
relation_to_id, | |
entity_to_id: Mapping[str, int], | |
relation_to_id: Mapping[str, int], |
entity_to_id, | ||
relation_to_id, | ||
) -> Tuple[torch.LongTensor, Collection[np.ndarray]]: | ||
if statements.size == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we re-use code from pykeen/triples/triples_factory.py
?
entity_to_id=self.entity_to_id, | ||
relation_to_id=self.relation_to_id, | ||
labels=labels, | ||
# entity_mask=entity_mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
"""Return the percentage of statements with qualifiers.""" | ||
return (~(self.mapped_statements[:, 3::2] == self.padding_idx).all(dim=1)).sum().item() / self.num_statements | ||
|
||
def extra_repr(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better overwrite iter_extra_repr
and yield from super
def __repr__(self) -> str: | ||
return f"{self.__class__.__name__}({self.extra_repr()})" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __repr__(self) -> str: | |
return f"{self.__class__.__name__}({self.extra_repr()})" |
not necessary
|
||
|
||
def create_statement_entity_mapping(statements: Iterable[str]): | ||
entities = list(sorted(set([e for s in statements for e in s[::2]]))) # padding is already in the ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment seems to assume numpy array, but the parameter type annotation says str
return s_p_q_to_multi_tails_new | ||
|
||
|
||
def _create_multi_label_instances( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we re-use this code for normal triples? If yes, then we can just extend the existing function.
This PR introduces a series of efforts on integrating hyper-relational graphs (aka RDF* or LPG) into PyKEEN with factories, datasets, and models.
Here is the adaptation of the code of our ISWC'21 paper:
CoreTriplesFactory
andTriplesFactory
)max_num_qualifier_pairs
to limit the maximum statement lengthLoading statements of various sizes, we now have padding entities/relations, right now they are the last entries of
entity_to_id
andrelation_to_id
. They are auxiliary and, for instance, we won't need to build inverse type for the padding relation - this is a bit similar to NodePiece with its auxiliary tokens.Some caveats:
pandas.read_csv
requires a pre-defined number of columns to load (otherwise crashes). So it's either a loading speed with pandas, or dummy for loops (but preserving the whole dataset)(num_statements, max_len)
is memory-inefficient - a lot of entries will be padding indices. We have functions to build sparse edge_index and qualifier_index, but they will be applicable for GNN encoders, not for training/evaluation. Ideally, we'd need some sort of RaggedTensor from TensorFlow, but there is no standard implementation of that in PyTorchThe first "egg" of the project "PyKEEN Hyper-Relational" 😅