Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does the Wikidata graph split affect scholia? #2423

Open
dpriskorn opened this issue Feb 8, 2024 · 15 comments
Open

How does the Wikidata graph split affect scholia? #2423

dpriskorn opened this issue Feb 8, 2024 · 15 comments
Labels
completeness how complete the data in Scholia/Wikidata is compared to what's out there performance the way Scholia treats the machines using it question something looking for an answer SPARQL the way Scholia queries Wikidata

Comments

@dpriskorn
Copy link

Context

See https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/IIA5LVHBYK45FSMLPIVZI6WXA5QSRPF4/

Question

How many queries need to be rewritten?
Can all of them be rewritten without adverse effects like timeout?
How much effort is it to rewrite?
Can the rewriting be automated somehow?

@dpriskorn dpriskorn added the question something looking for an answer label Feb 8, 2024
@dpriskorn dpriskorn changed the title how does the graph split affect scholia? How does the Wikidata graph split affect scholia? Feb 8, 2024
@Daniel-Mietchen
Copy link
Member

Thanks for keeping an eye on this, @dpriskorn ! The link in your post does not work for me, so here is another link to what is probably the same message.

We are looking into the matter and do not have good answers to your questions yet, but here are some guesstimates:

  1. How many queries need to be rewritten?
  2. Can all of them be rewritten without adverse effects like timeout?
    • Not if the timeout settings remain the same, since federation adds complexity. Working with a static dataset might have some performance benefits though.
  3. How much effort is it to rewrite?
    • We need to review all queries as to whether they are affected, i.e. as to whether they (a) run on either of the new main or scholarly endpoints and (b) give the same results as the full endpoint. This could probably be largely automated in a matter of hours by someone who understands the matter.
    • For any of the queries that fail to run or where the results differ in substance, we would need to rewrite them. Assuming an average of 5-10 min per query, that means something on the order of a person week of work time. I suspect that some queries might not work usefully, so we would need to change their functionality.
    • Perhaps we need a dedicated hackathon just for such adaptations of Scholia queries.
  4. Can the rewriting be automated somehow?

@Daniel-Mietchen Daniel-Mietchen added SPARQL the way Scholia queries Wikidata performance the way Scholia treats the machines using it completeness how complete the data in Scholia/Wikidata is compared to what's out there labels Feb 8, 2024
@fnielsen
Copy link
Collaborator

fnielsen commented Feb 8, 2024

I have tried one here: https://synia.toolforge.org/#author/Q18618629 just changing the endpoint. There were issues ("Recent publications from experimental scholarly endpoint ")

  • The instance of is not there.
  • The split is not including, e.g., chapter, working paper, ...
  • Labeling does not work for non scientific papers.

@egonw
Copy link
Collaborator

egonw commented Feb 10, 2024

just changing the endpoint

@fnielsen, I think the split will mean federated SPARQL queries over the two servers. Did you try that already? Where can I find the SPARQL of the Recent publications from experimental scholarly endpoint? I could not spot the link to the matching query service. Does it not have a QS for the individual endpoints yet? That would make development a lot more difficult.

@egonw
Copy link
Collaborator

egonw commented Feb 10, 2024

Can all of them be rewritten without adverse effects like timeout?

@dpriskorn, no, I don't think so. This initial split is suffering from the problem we highlighted in a telcon last year: queries break and cannot be easily solved with SPARQL. the key problem is that statements (like P2860) have object and subject split over the two QS-s... this will require to figure out which statements have content in both (multiple) QS-s, then do a fusion of that data, before moving to the next statement

Example query that returns empty is this one: https://w.wiki/98JL

@egonw
Copy link
Collaborator

egonw commented Feb 10, 2024

I just tried rewriting it, but it's nasty because essential info is split over the two resources (to be run at https://query-scholarly-experimental.wikidata.org/):

select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?role where {
  # get the intention types from the "main" WDQS
  SERVICE <https://query-main-experimental.wikidata.org/sparql> {
    ?intention wdt:P31 wd:Q96471816 .
  }

  # get the citing works from the "main" WDQS
  {
    SERVICE <https://query-main-experimental.wikidata.org/sparql> {
      select distinct ?work (min(?years) as ?year) ?type_ where {
        ?work wdt:P577 ?dates ;
              p:P2860 / pq:P3712 ?intention .
        bind(str(year(?dates)) as ?years) .
        OPTIONAL {
          ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
        }
      } group by ?work ?type_
    }
  }
  UNION
  # get the citing works from the "scholarly" WDQS
  {
    select distinct ?work (min(?years) as ?year) ?type_ where {
      ?work wdt:P577 ?dates ;
            p:P2860 / pq:P3712 ?intention .
      bind(str(year(?dates)) as ?years) .
      OPTIONAL {
        ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
      }
    }
    group by ?work ?type_
  }

  hint:Prior hint:runFirst true .

  # now look up some additional info (only available from the "main" WDQS
  SERVICE <https://query-main-experimental.wikidata.org/sparql> {
    ?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
    MINUS { ?venue_ wdt:P31 wd:Q1143604 }
  }
  bind(
    coalesce(
      if(bound(?type_), ?venue,
      'other source')
    ) as ?role
  )
}
group by ?year ?type_ ?role
order by ?year

It times out.

@egonw
Copy link
Collaborator

egonw commented Feb 10, 2024

When I run the query from main I get closer, and it runs in reasonable time:

select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?venue_ ?role where {
  # get the intention types from the "main" WDQS
  ?intention wdt:P31 wd:Q96471816 .

  # get the articles from the "scholarly" WDQS
  {
    SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
      select distinct ?work (min(?years) as ?year) ?type_ where {
        ?work wdt:P577 ?dates ;
              p:P2860 / pq:P3712 ?intention .
        bind(str(year(?dates)) as ?years) .
        OPTIONAL {
          ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
        }
      } group by ?work ?type_
    }
  }
  UNION
  # get the articles from the "main" WDQS
  {
    select distinct ?work (min(?years) as ?year) ?type_ where {
      ?work wdt:P577 ?dates ;
            p:P2860 / pq:P3712 ?intention .
      bind(str(year(?dates)) as ?years) .
      OPTIONAL {
        ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
      }
    }
    group by ?work ?type_
  }

  hint:Prior hint:runFirst true .

  # now look up some additional info: venue
  # get the venue info from the "scholarly" WDQS
  OPTIONAL {
    ?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
    MINUS { ?venue_ wdt:P31 wd:Q1143604 }
  }
  bind(
    coalesce(
      if(bound(?type_), ?venue,
      'other source')
    ) as ?role
  )
}
group by ?year ?type_ ?venue_ ?role
order by ?year

But you can see from the results that the venue information is split over the two QS-s (the above query missed venue info). As soon as I try looking up the venue info from both instances, it times out again.

@egonw
Copy link
Collaborator

egonw commented Feb 11, 2024

(crossposted from the Telegram Wikicite channel)

I just finished a query that shows how content is scattered over the two splits: https://w.wiki/98km One of the powers of SPARQL is to be able to search the linking ("web"), unlike, for example, label searching. But if we search for a link (the Statement in Wikidata terms), this becomes hard when those links are split too: you effectively have to search in both QSs. This is what I tried yesterday with #2423 (comment) (above): but since SPARQL commonly includes a pattern of two and more links, this is not trivial at all. Indeed, I ran into timeouts. I do not think this is special for Scholia, but applies to any tool that uses SPARQL where Statements are split over the two instances. Of course, this query just looks at one direct claim, and the GitHub issue shows that "two or more" is with qualifiers.

Basically, splitting works if the content can be split. But the power of Wikidata is the complexity of human language, but then with machine readability. Qualifiers are all over the place. So, when i say, "that I feel that Wikidata has failed", more accurately I should say "the query service has failed" and that I think that the QS is a essential part of the eco system (also for Wikibase, for the matter). This is just opinion. Let me stress, the problems are real and we need a real solution. This real solution is hard. This splitting is not the first solution being sought. The Scholia project has been actively looking into alternatives, including a dedicated WDQS, a QS with a lag (but see notes about loading times being days, rather than hours), and the subsetting work (see https://content.iospress.com/articles/semantic-web/sw233491). It is compicated and 5 years ago I has naive and optimistic that computer science would develop a scalable triple store with SPARQL endpoint that meets the Wikidata needs. Sadly, the CS field did not live up to my hopes. So, my tears (":(") are real. And the scalability problems that Wikidata are seeing important and to me very serious and nothing to joke about.

@fnielsen
Copy link
Collaborator

Where can I find the SPARQL of the Recent publications from experimental scholarly endpoint?

https://synia.toolforge.org/#author/Q18618629 - third table

@egonw
Copy link
Collaborator

egonw commented Feb 12, 2024

https://synia.toolforge.org/#author/Q18618629 - third table

Yes, got that :) But unlike the other tables, this one does not have a link to the matching query service. I wanted to see the SPARQL itself, not the results.

I think I should be able to find it in the Wiki itself, but the Synia setup I wrote was already too long ago that I can easily find it.

@fnielsen
Copy link
Collaborator

I wanted to see the SPARQL itself, not the results.

https://www.wikidata.org/wiki/Wikidata:Synia:author#Recent_publications_from_experimental_scholarly_endpoint

@egonw
Copy link
Collaborator

egonw commented Feb 12, 2024

https://www.wikidata.org/wiki/Wikidata:Synia:author#Recent_publications_from_experimental_scholarly_endpoint

Thanks. Now that I have seen the query, I think that one runs into exactly the problem I experienced and tried to describe.

@dpriskorn
Copy link
Author

Instead of rewriting or bothering about the split, I suggest we focus on running QLever ourselves and improve it to do what we want no matter the growth of Wikidata. See the discussion I started #2425

@physikerwelt
Copy link
Contributor

I discussed that briefly with @Daniel-Mietchen today. To me it seems that the one time split does conceptually not solve any scaling issues and should not be done in the way as it is currently planned. If done, it should be done transparently to the user, i.e. the query might be executed on different back-ends, but it should not be required to change the query.

@egonw
Copy link
Collaborator

egonw commented Mar 2, 2024

i.e. the query might be executed on different back-ends, but it should not be required to change the query.

What I found is that this is not trivial at all: you cannot simply run a query on both endpoints and then merge the results.

@WolfgangFahl
Copy link
Collaborator

see #2412 - for a mitigation path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
completeness how complete the data in Scholia/Wikidata is compared to what's out there performance the way Scholia treats the machines using it question something looking for an answer SPARQL the way Scholia queries Wikidata
Projects
None yet
Development

No branches or pull requests

6 participants