Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only first page of Hydra paged collection returned #1180

Closed
ddeboer opened this issue Mar 20, 2023 · 7 comments
Closed

Only first page of Hydra paged collection returned #1180

ddeboer opened this issue Mar 20, 2023 · 7 comments

Comments

@ddeboer
Copy link

ddeboer commented Mar 20, 2023

Issue type:

  • 馃悰 Bug

Description:

Assume a paginated Turtle file at https://opendata.picturae.com/catalog.ttl?page=1. (This uses the legacy hydra:PagedCollection rather than the newer hydra:PartialCollectionView, but Comunica seems to support both.)

A query using two predicates returns the expected output (a count of 209 resources that are spread over 3 pages):

comunica-sparql -l debug 'https://opendata.picturae.com/catalog.ttl?page=1' 'select (count(?s) as ?c) {?s a <http://www.w3.org/ns/dcat#Dataset> ; <http://purl.org/dc/terms/identifier> ?i }'
[2023-03-20T20:20:18.029Z]  INFO: Requesting https://opendata.picturae.com/catalog.ttl?page=1 { headers: { accept: 'application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,application/json;q=0.45,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,text/shaclc;q=0.1,text/shaclc-ext;q=0.05', 'user-agent': 'Comunica/actor-http-fetch (Node.js v18.14.1; darwin)' }, method: 'GET', actor: 'urn:comunica:default:http/actors#fetch' }
[2023-03-20T20:20:19.118Z]  INFO: Identified as file source: https://opendata.picturae.com/catalog.ttl?page=1 { actor: 'urn:comunica:default:rdf-resolve-hypermedia/actors#none' }
[2023-03-20T20:20:19.134Z]  DEBUG: Determined physical join operator 'inner-bind' { entries: 2, variables: [ [ 's' ], [ 's', 'i' ] ], costs: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': 106100.40607586392, 'inner-hash': 213500, 'inner-symmetric-hash': 212600, 'inner-nested-loop': 222200, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined }, coefficients: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': { iterations: 0.3899427883167721, persistedItems: 0, blockingItems: 0, requestTime: 1061.000161330756 }, 'inner-hash': { iterations: 200, persistedItems: 100, blockingItems: 100, requestTime: 2122 }, 'inner-symmetric-hash': { iterations: 200, persistedItems: 200, blockingItems: 0, requestTime: 2122 }, 'inner-nested-loop': { iterations: 10000, persistedItems: 0, blockingItems: 0, requestTime: 2122 }, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined } }
[2023-03-20T20:20:19.135Z]  DEBUG: First entry for Bind Join:  { entry: Quad { termType: 'Quad', value: '', subject: Variable { termType: 'Variable', value: 's' }, predicate: NamedNode { termType: 'NamedNode', value: 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' }, object: NamedNode { termType: 'NamedNode', value: 'http://www.w3.org/ns/dcat#Dataset' }, graph: DefaultGraph { termType: 'DefaultGraph', value: '' }, type: 'pattern' }, metadata: { requestTime: 1061, pageSize: 100, cardinality: { type: 'exact', value: 100 }, first: 'https://opendata.picturae.com/catalog.ttl?page=1', next: 'https://opendata.picturae.com/catalog.ttl?page=2', previous: null, last: 'https://opendata.picturae.com/catalog.ttl?page=3', searchForms: { values: [] }, canContainUndefs: false, order: undefined, availableOrders: undefined, variables: [ Variable { termType: 'Variable', value: 's' } ] }, actor: 'urn:comunica:default:rdf-join/actors#inner-multi-bind' }
[[2023-03-20T20:20:19.142Z]  INFO: Requesting https://opendata.picturae.com/catalog.ttl?page=2 { headers: { accept: 'application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,application/json;q=0.45,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,text/shaclc;q=0.1,text/shaclc-ext;q=0.05', 'user-agent': 'Comunica/actor-http-fetch (Node.js v18.14.1; darwin)' }, method: 'GET', actor: 'urn:comunica:default:http/actors#fetch' }
[2023-03-20T20:20:20.095Z]  INFO: Identified as file source: https://opendata.picturae.com/catalog.ttl?page=2 { actor: 'urn:comunica:default:rdf-resolve-hypermedia/actors#none' }
[2023-03-20T20:20:20.111Z]  INFO: Requesting https://opendata.picturae.com/catalog.ttl?page=3 { headers: { accept: 'application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,application/json;q=0.45,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,text/shaclc;q=0.1,text/shaclc-ext;q=0.05', 'user-agent': 'Comunica/actor-http-fetch (Node.js v18.14.1; darwin)' }, method: 'GET', actor: 'urn:comunica:default:http/actors#fetch' }
[2023-03-20T20:20:20.264Z]  INFO: Identified as file source: https://opendata.picturae.com/catalog.ttl?page=3 { actor: 'urn:comunica:default:rdf-resolve-hypermedia/actors#none' }

{"c":"\"209\"^^http://www.w3.org/2001/XMLSchema#integer"}
]

However, when adding a third predicate to the query (that all resources have), things fall apart:

comunica-sparql -l debug 'https://opendata.picturae.com/catalog.ttl?page=1' 'select (count(?s) as ?c) {?s a <http://www.w3.org/ns/dcat#Dataset> ; <http://purl.org/dc/terms/identifier> ?i; <http://purl.org/dc/terms/issued> ?issued }'
[2023-03-20T20:20:04.874Z]  INFO: Requesting https://opendata.picturae.com/catalog.ttl?page=1 { headers: { accept: 'application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,application/json;q=0.45,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,text/shaclc;q=0.1,text/shaclc-ext;q=0.05', 'user-agent': 'Comunica/actor-http-fetch (Node.js v18.14.1; darwin)' }, method: 'GET', actor: 'urn:comunica:default:http/actors#fetch' }
[2023-03-20T20:20:06.144Z]  INFO: Identified as file source: https://opendata.picturae.com/catalog.ttl?page=1 { actor: 'urn:comunica:default:rdf-resolve-hypermedia/actors#none' }
[2023-03-20T20:20:06.163Z]  DEBUG: Determined physical join operator 'inner-bind' { entries: 3, variables: [ [ 's' ], [ 's', 'i' ], [ 's', 'issued' ] ], costs: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': 124400.8555483328, 'inner-hash': undefined, 'inner-symmetric-hash': undefined, 'inner-nested-loop': undefined, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': 1373200 }, coefficients: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': { iterations: 0.7798855766335442, persistedItems: 0, blockingItems: 0, requestTime: 1244.0007566275617 }, 'inner-hash': undefined, 'inner-symmetric-hash': undefined, 'inner-nested-loop': undefined, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': { iterations: 1000000, persistedItems: 0, blockingItems: 0, requestTime: 3732 } } }
[2023-03-20T20:20:06.164Z]  DEBUG: First entry for Bind Join:  { entry: Quad { termType: 'Quad', value: '', subject: Variable { termType: 'Variable', value: 's' }, predicate: NamedNode { termType: 'NamedNode', value: 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' }, object: NamedNode { termType: 'NamedNode', value: 'http://www.w3.org/ns/dcat#Dataset' }, graph: DefaultGraph { termType: 'DefaultGraph', value: '' }, type: 'pattern' }, metadata: { requestTime: 1244, pageSize: 100, cardinality: { type: 'exact', value: 100 }, first: 'https://opendata.picturae.com/catalog.ttl?page=1', next: 'https://opendata.picturae.com/catalog.ttl?page=2', previous: null, last: 'https://opendata.picturae.com/catalog.ttl?page=3', searchForms: { values: [] }, canContainUndefs: false, order: undefined, availableOrders: undefined, variables: [ Variable { termType: 'Variable', value: 's' } ] }, actor: 'urn:comunica:default:rdf-join/actors#inner-multi-bind' }
[[2023-03-20T20:20:06.169Z]  DEBUG: Determined physical join operator 'inner-nested-loop' { entries: 2, variables: [ [ 'i' ], [ 'issued' ] ], costs: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': undefined, 'inner-hash': 2501, 'inner-symmetric-hash': 2492, 'inner-nested-loop': 2489, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined }, coefficients: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': undefined, 'inner-hash': { iterations: 2, persistedItems: 1, blockingItems: 1, requestTime: 24.88 }, 'inner-symmetric-hash': { iterations: 2, persistedItems: 2, blockingItems: 0, requestTime: 24.88 }, 'inner-nested-loop': { iterations: 1, persistedItems: 0, blockingItems: 0, requestTime: 24.88 }, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined } }
[2023-03-20T20:20:06.173Z]  INFO: Requesting https://opendata.picturae.com/catalog.ttl?page=2 { headers: { accept: 'application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,application/json;q=0.45,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,text/shaclc;q=0.1,text/shaclc-ext;q=0.05', 'user-agent': 'Comunica/actor-http-fetch (Node.js v18.14.1; darwin)' }, method: 'GET', actor: 'urn:comunica:default:http/actors#fetch' }
[2023-03-20T20:20:07.148Z]  INFO: Identified as file source: https://opendata.picturae.com/catalog.ttl?page=2 { actor: 'urn:comunica:default:rdf-resolve-hypermedia/actors#none' }
[2023-03-20T20:20:07.163Z]  INFO: Requesting https://opendata.picturae.com/catalog.ttl?page=3 { headers: { accept: 'application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,application/json;q=0.45,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,text/shaclc;q=0.1,text/shaclc-ext;q=0.05', 'user-agent': 'Comunica/actor-http-fetch (Node.js v18.14.1; darwin)' }, method: 'GET', actor: 'urn:comunica:default:http/actors#fetch' }
[2023-03-20T20:20:07.583Z]  INFO: Identified as file source: https://opendata.picturae.com/catalog.ttl?page=3 { actor: 'urn:comunica:default:rdf-resolve-hypermedia/actors#none' }
[2023-03-20T20:20:07.588Z]  DEBUG: Determined physical join operator 'inner-nested-loop' { entries: 2, variables: [ [ 'i' ], [ 'issued' ] ], costs: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': undefined, 'inner-hash': 2501, 'inner-symmetric-hash': 2492, 'inner-nested-loop': 2489, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined }, coefficients: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': undefined, 'inner-hash': { iterations: 2, persistedItems: 1, blockingItems: 1, requestTime: 24.88 }, 'inner-symmetric-hash': { iterations: 2, persistedItems: 2, blockingItems: 0, requestTime: 24.88 }, 'inner-nested-loop': { iterations: 1, persistedItems: 0, blockingItems: 0, requestTime: 24.88 }, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined } }
[2023-03-20T20:20:07.595Z]  DEBUG: Determined physical join operator 'inner-nested-loop' { entries: 2, variables: [ [ 'i' ], [ 'issued' ] ], costs: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': undefined, 'inner-hash': 2501, 'inner-symmetric-hash': 2492, 'inner-nested-loop': 2489, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined }, coefficients: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': undefined, 'inner-hash': { iterations: 2, persistedItems: 1, blockingItems: 1, requestTime: 24.88 }, 'inner-symmetric-hash': { iterations: 2, persistedItems: 2, blockingItems: 0, requestTime: 24.88 }, 'inner-nested-loop': { iterations: 1, persistedItems: 0, blockingItems: 0, requestTime: 24.88 }, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined } }
[2023-03-20T20:20:07.600Z]  DEBUG: Determined physical join operator 'inner-nested-loop' { entries: 2, variables: [ [ 'i' ], [ 'issued' ] ], costs: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': undefined, 'inner-hash': 2501, 'inner-symmetric-hash': 2492, 'inner-nested-loop': 2489, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined }, coefficients: { 'inner-none': undefined, 'inner-single': undefined, 'inner-multi-empty': undefined, 'inner-bind': undefined, 'inner-hash': { iterations: 2, persistedItems: 1, blockingItems: 1, requestTime: 24.88 }, 'inner-symmetric-hash': { iterations: 2, persistedItems: 2, blockingItems: 0, requestTime: 24.88 }, 'inner-nested-loop': { iterations: 1, persistedItems: 0, blockingItems: 0, requestTime: 24.88 }, 'optional-bind': undefined, 'optional-nested-loop': undefined, 'minus-hash': undefined, 'minus-hash-undef': undefined, 'inner-multi-smallest': undefined } }

<snip>

{"c":"\"100\"^^http://www.w3.org/2001/XMLSchema#integer"}

Only 100 resources, so only the first page of them, are returned. All resources have the predicate dct:issued. We can validate this by replacing dct:identifier with dct:issued, which returns 209 results again.


Environment:

software version
Comunica Engine 2.6.9
node v18.14.1
npm 9.3.1
yarn 1.22.19
Operating System darwin (Darwin 22.3.0)
@github-actions
Copy link

Thanks for reporting!

@ddeboer
Copy link
Author

ddeboer commented Mar 21, 2023

@rubensworks Is this an easy fix on your side? If not, I will implement a workaround on ours.

@rubensworks
Copy link
Member

Not sure yet, will look into it next week.

@rubensworks
Copy link
Member

rubensworks commented Mar 27, 2023

Support for paged collections (non-TPF/QPF) in Comunica is not well-tested at the moment, so issues like these are not surprising.
In any case, this is something we want to properly support, so this must be fixed.
But we won't be able to fix this in the very short term, so a workaround is preferred.

Some notes to self:

Simpler query that also fails to produce results:

SELECT *
WHERE {
  <https://opendata.picturae.com/dataset/dre_a2a_webservice> <http://purl.org/dc/terms/identifier> ?i.
  <https://opendata.picturae.com/dataset/dre_a2a_webservice> <http://purl.org/dc/terms/issued> ?issued.
}

The problem is that the linked hypermedia iterator is overwriting metadata per new page. In this case, each page is defaulting to the none-source-type, which provides exact cardinalities for matches in that page (while TPF falls uses Hydra cardinality). This causes the empty-join actor to be used, which returns an empty result stream.

One solution would be to merge (and test) the feature/adaptive-join branch. In that branch, we prefer dataset-level cardinalities, which will contain the Hydra cardinality, so that we don't use the page-specific cardinality.

Also, if comunica/comunica-feature-link-traversal#102 is the same problem, we will want to change the none-source-type to not emit exact cardinalities, but only lowerLimit cardinalities. Furthermore, the empty-join can then not be used for lowerLimit's.

@rubensworks rubensworks self-assigned this Mar 27, 2023
rubensworks added a commit that referenced this issue Apr 6, 2023
This caused problems related to dataset-level cardinalities
that were found in the initial source being overridden without
proper accumulation with exact cardinalities from later sources

Closes #1156
Closes #1180

May be related to comunica/comunica-feature-link-traversal#102
rubensworks added a commit that referenced this issue Apr 12, 2023
This caused problems related to dataset-level cardinalities
that were found in the initial source being overridden without
proper accumulation with exact cardinalities from later sources

Closes #1156
Closes #1180

May be related to comunica/comunica-feature-link-traversal#102
Maintenance automation moved this from Triage to Done May 22, 2023
@ddeboer
Copy link
Author

ddeboer commented May 2, 2024

@rubensworks This issue was closed, but the situation persists where

comunica-sparql -l debug 'https://opendata.picturae.com/catalog.ttl?page=1' 'select (count(?s) as ?c) {?s a <http://www.w3.org/ns/dcat#Dataset> ; <http://purl.org/dc/terms/identifier> ?i; <http://purl.org/dc/terms/issued> ?issued }'

returns only the first page (100 results), not all (209 results).

@rubensworks
Copy link
Member

Probably a regression in v3.

We should make sure to add a proper integration test for this case.

But at least we know where to fix the problem now.

@rubensworks rubensworks reopened this May 2, 2024
Maintenance automation moved this from Done to To Do (prio:medium) May 2, 2024
@rubensworks rubensworks moved this from To Do (prio:medium) to To Do (prio:high) in Maintenance May 2, 2024
@rubensworks
Copy link
Member

@ddeboer This has been fixed in release 3.1.1!

Maintenance automation moved this from To Do (prio:high) to Done May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Maintenance
  
Done
Development

No branches or pull requests

2 participants