Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially unintended consequences of limit in dbpedia_get_wikidata_uris() #29

Open
ChristophLeonhardt opened this issue Feb 21, 2024 · 1 comment

Comments

@ChristophLeonhardt
Copy link
Collaborator

The parameter limit might have unintended consequences, at least when following the current documentation of the dbpedia_get_wikidata_uris() function.

Take this as an example:

wikidata_uris <- dbpedia_get_wikidata_uris(
  x = c("http://dbpedia.org/resource/London", "http://dbpedia.org/resource/Washington,_D.C."),
  endpoint = "https://dbpedia.org/sparql/",
  wait = 5,
  limit = 2,
  progress = TRUE
)

In this example, the two queries are processed as one chunk, i.e. in one query sent to the endpoint. Although both items have Wikidata IDs associated with them in DBpedia, only Wikidata IDs for the first item are returned.

At first glance, this might be expected behavior. The limit argument is used as a parameter of the query and controls the number of results returned by the server. If it is set to 2, and the first item in the query has more than one Wikidata ID in the "sameAs" property (which it does in this example), then all returned Wikidata IDs will be for this first item only.

However, limit is also used to split the input vector, i.e. the URIs in x into chunks. This is also how the argument is documented in the package. This is why both URIs are passed in a single SPARQL query which includes the single limit argument for both items.

In the case above, a larger value for limit would solve the problem as it would allow all values for both items to be returned.

But I think that using limit for both purposes - in the query and for chunking the input vector - might be confusing and should be reconsidered.

@ablaette
Copy link
Contributor

Just want to report that after having introduced a distinction between 'chunksize' and 'limit', a divergence resulted in joins problems: Essentially, the value should be identical!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants