Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JsonLdProcessor._compare_rdf_triples() is a massive performance hog in parse_nquads #169

Open
RinkeHoekstra opened this issue Nov 4, 2022 · 0 comments

Comments

@RinkeHoekstra
Copy link

When building a dataset from N-Quads, the JsonLdProcessor checks for every triple whether it is unique. This is done through a pairwise comparison in JsonLdProcessor._compare_rdf_triples()

This means that the triples being compared grows exponentially with the size of the dataset (or at least, the graph).

if JsonLdProcessor._compare_rdf_triples(t, triple):

To give some metrics, for a 14k line N-Quads file, all in a single graph, the time drops from 18.8s with on my M1 mac to 0.7s without comparison.

Given the limited occurrence and impact of duplicate triples/quads in N-Quads files, this is really way too expensive.

At the very least, the parser could build an index (HashMap or dict) to speed up this comparison; but given that the JSON-LD builder that usually follows this step does this too, the entire comparison could be dropped as a whole.

@RinkeHoekstra RinkeHoekstra changed the title JsonLdProcessor._compare_rdf_triples() is a massive performance hog in from_rdf JsonLdProcessor._compare_rdf_triples() is a massive performance hog in parse_nquads Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant