Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate bindings on construct query #1300

Open
coret opened this issue Dec 23, 2023 · 6 comments
Open

Duplicate bindings on construct query #1300

coret opened this issue Dec 23, 2023 · 6 comments

Comments

@coret
Copy link

coret commented Dec 23, 2023

Issue type:

  • ➕ Feature request

Description:

I get unexpected results with comunica-sparql https://lab.coret.org/rdf/c1.jsonld -q 'CONSTRUCT WHERE { <https://lab.coret.org/id/comunica_testcase_1> ?p ?o ; <http://schema.org/distribution> ?d . ?d ?e ?f }'. The output contains a lot of duplicate triples (like some kind of Cartesian product):

<https://lab.coret.org/id/comunica_testcase_1> <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1a>, <https://lab.coret.org/id/comunica_testcase_1a>.
<https://lab.coret.org/id/comunica_testcase_1a> <http://schema.org/description> "Distributie 1a".
<https://lab.coret.org/id/comunica_testcase_1> <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1b>, <https://lab.coret.org/id/comunica_testcase_1a>.
<https://lab.coret.org/id/comunica_testcase_1a> <http://schema.org/description> "Distributie 1a".
<https://lab.coret.org/id/comunica_testcase_1> <http://schema.org/keywords> "Keyword 1";
    <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1a>.
<https://lab.coret.org/id/comunica_testcase_1a> <http://schema.org/description> "Distributie 1a".
<https://lab.coret.org/id/comunica_testcase_1> <http://schema.org/keywords> "Keyword 2";
    <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1a>.
<https://lab.coret.org/id/comunica_testcase_1a> <http://schema.org/description> "Distributie 1a".
<https://lab.coret.org/id/comunica_testcase_1> a <http://schema.org/Dataset>;
    <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1a>.
<https://lab.coret.org/id/comunica_testcase_1a> <http://schema.org/description> "Distributie 1a".
<https://lab.coret.org/id/comunica_testcase_1> <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1a>, <https://lab.coret.org/id/comunica_testcase_1b>.
<https://lab.coret.org/id/comunica_testcase_1b> <http://schema.org/description> "Distributie 1b".
<https://lab.coret.org/id/comunica_testcase_1> <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1b>, <https://lab.coret.org/id/comunica_testcase_1b>.
<https://lab.coret.org/id/comunica_testcase_1b> <http://schema.org/description> "Distributie 1b".
<https://lab.coret.org/id/comunica_testcase_1> <http://schema.org/keywords> "Keyword 1";
    <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1b>.
<https://lab.coret.org/id/comunica_testcase_1b> <http://schema.org/description> "Distributie 1b".
<https://lab.coret.org/id/comunica_testcase_1> <http://schema.org/keywords> "Keyword 2";
    <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1b>.
<https://lab.coret.org/id/comunica_testcase_1b> <http://schema.org/description> "Distributie 1b".
<https://lab.coret.org/id/comunica_testcase_1> a <http://schema.org/Dataset>;
    <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1b>.
<https://lab.coret.org/id/comunica_testcase_1b> <http://schema.org/description> "Distributie 1b".

The source graph in Turtle being:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .

<https://lab.coret.org/id/comunica_testcase_1> a schema:Dataset ;
	schema:distribution <https://lab.coret.org/id/comunica_testcase_1a>, <https://lab.coret.org/id/comunica_testcase_1b> ;
	schema:keywords "Keyword 1", "Keyword 2" .

<https://lab.coret.org/id/comunica_testcase_1a> schema:description "Distributie 1a" .

<https://lab.coret.org/id/comunica_testcase_1b> schema:description "Distributie 1b" .

The expected result (as given by Apache Jena and GraphDB):

<https://lab.coret.org/id/comunica_testcase_1>
        a <http://schema.org/Dataset> ;
        <http://schema.org/distribution> <https://lab.coret.org/id/comunica_testcase_1a> , <https://lab.coret.org/id/comunica_testcase_1b> ;
        <http://schema.org/keywords>  "Keyword 1" , "Keyword 2" .

<https://lab.coret.org/id/comunica_testcase_1b>
        <http://schema.org/description> "Distributie 1b" .

<https://lab.coret.org/id/comunica_testcase_1a>
        <http://schema.org/description> "Distributie 1a" .

The issue is not with the comunica-sparql CLI tool, but with the comunica core. This issue is a slimmed down version of the issue as we encounter with the NDE Dataset Register - netwerk-digitaal-erfgoed/dataset-register#831 - where we use Comunica a lot. In this particular case the number of bindings explodes above our set maximum of 50000.


Environment:

software version
Comunica Init Actor 1.22.2
node v16.14.2
npm 9.7.1
yarn 1.22.19
Operating System linux (Linux 6.1.0-13-amd64)

Bounty

A bounty has been placed on this issue by:

Comunica Association
€272

Click here to learn more if you're interested in claiming this bounty by resolving this issue.

Copy link

Thanks for reporting!

@rubensworks rubensworks added this to Triage in Maintenance Dec 23, 2023
@rubensworks
Copy link
Member

Strictly speaking, this is not really a bug, since CONSTRUCT queries return RDF graphs, which are sets of triples.
So if there are duplicate triples in the syntax, processors can remove duplicates following set semantics.

The reason why Comunica produces these duplicates is because CONSTRUCT queries are built on top of SELECT queries which instantiates triple templates, and may therefore produce duplicate triples.
Since Comunica produces results in a streaming manner, these duplicates are not removed.

That being said, I agree that it may sometimes be inconvenient to have these duplicates. So I'm converting this issue to a feature request where CONSTRUCT queries may optionally run in a DISTINCT-mode where duplicate triples will explicitly be removed, at the cost of an increase in memory usage for large results.

@rubensworks rubensworks added this to Triage in Development Jan 8, 2024
@karelklima
Copy link
Contributor

I ran into this issue as well recently. My solution was to put the resulting triples to N3.Store and then iterate the data from there. Not efficient, but it got the job done.

@jacoscaz
Copy link
Contributor

jacoscaz commented Feb 27, 2024

at the cost of an increase in memory usage for large results.

Just as a suggestion based on experience with jacoscaz/quadstore#155: when implementing the removal of duplicate triples consider giving users a way to set the maximum size of the set so that comunica may have a chance to throw an error and recover instead of potentially crashing the process by making it run out of memory.

@rubensworks
Copy link
Member

A bounty has been placed on this issue via the Comunica Association (see original post).

@simonvbrae
Copy link
Contributor

I have done a bit of work toward this on this branch.

Changes to packages/actor-init-query/lib/ActorInitQuery.ts were a first step to implementing deduplication using N3.Store.
Other changes add the necessary command line argument.

@rubensworks rubensworks removed this from Triage in Maintenance Apr 19, 2024
@rubensworks rubensworks moved this from Triage to In progress in Development Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development
  
In progress
Development

No branches or pull requests

5 participants