How does the Wikidata graph split affect scholia? #2423

dpriskorn · 2024-02-08T05:48:47Z

Context

See https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/IIA5LVHBYK45FSMLPIVZI6WXA5QSRPF4/

Question

How many queries need to be rewritten?
Can all of them be rewritten without adverse effects like timeout?
How much effort is it to rewrite?
Can the rewriting be automated somehow?

Daniel-Mietchen · 2024-02-08T08:52:46Z

Thanks for keeping an eye on this, @dpriskorn ! The link in your post does not work for me, so here is another link to what is probably the same message.

We are looking into the matter and do not have good answers to your questions yet, but here are some guesstimates:

How many queries need to be rewritten?
- The majority of the nearly 400 Scholia queries would need to be rewritten (probably well over 300) — all those that make use of scholarly article or any of the associated properties (title, author, author name string, main subject, published in, describes a method that uses, publication date, number of pages), plus a number of indirect ones, e.g. things related to publisher or affiliation .
Can all of them be rewritten without adverse effects like timeout?
- Not if the timeout settings remain the same, since federation adds complexity. Working with a static dataset might have some performance benefits though.
How much effort is it to rewrite?
- We need to review all queries as to whether they are affected, i.e. as to whether they (a) run on either of the new main or scholarly endpoints and (b) give the same results as the full endpoint. This could probably be largely automated in a matter of hours by someone who understands the matter.
- For any of the queries that fail to run or where the results differ in substance, we would need to rewrite them. Assuming an average of 5-10 min per query, that means something on the order of a person week of work time. I suspect that some queries might not work usefully, so we would need to change their functionality.
- Perhaps we need a dedicated hackathon just for such adaptations of Scholia queries.
Can the rewriting be automated somehow?
- To some extent yes — see also the discussion in convert templated queries to named queries and separate concerns by introducing named query middleware #2412 .
- On the way, we could consider interactions with efforts to document SPARQL queries (e.g. as discussed here) or to modularize them (examples).

fnielsen · 2024-02-08T12:56:43Z

I have tried one here: https://synia.toolforge.org/#author/Q18618629 just changing the endpoint. There were issues ("Recent publications from experimental scholarly endpoint ")

The instance of is not there.
The split is not including, e.g., chapter, working paper, ...
Labeling does not work for non scientific papers.

egonw · 2024-02-10T09:47:06Z

just changing the endpoint

@fnielsen, I think the split will mean federated SPARQL queries over the two servers. Did you try that already? Where can I find the SPARQL of the Recent publications from experimental scholarly endpoint? I could not spot the link to the matching query service. ~~Does it not have a QS for the individual endpoints yet? That would make development a lot more difficult.~~

egonw · 2024-02-10T09:59:25Z

Can all of them be rewritten without adverse effects like timeout?

@dpriskorn, no, I don't think so. This initial split is suffering from the problem we highlighted in a telcon last year: queries break and cannot be easily solved with SPARQL. the key problem is that statements (like P2860) have object and subject split over the two QS-s... this will require to figure out which statements have content in both (multiple) QS-s, then do a fusion of that data, before moving to the next statement

Example query that returns empty is this one: https://w.wiki/98JL

egonw · 2024-02-10T10:44:15Z

I just tried rewriting it, but it's nasty because essential info is split over the two resources (to be run at https://query-scholarly-experimental.wikidata.org/):

select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?role where {
  # get the intention types from the "main" WDQS
  SERVICE <https://query-main-experimental.wikidata.org/sparql> {
    ?intention wdt:P31 wd:Q96471816 .
  }

  # get the citing works from the "main" WDQS
  {
    SERVICE <https://query-main-experimental.wikidata.org/sparql> {
      select distinct ?work (min(?years) as ?year) ?type_ where {
        ?work wdt:P577 ?dates ;
              p:P2860 / pq:P3712 ?intention .
        bind(str(year(?dates)) as ?years) .
        OPTIONAL {
          ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
        }
      } group by ?work ?type_
    }
  }
  UNION
  # get the citing works from the "scholarly" WDQS
  {
    select distinct ?work (min(?years) as ?year) ?type_ where {
      ?work wdt:P577 ?dates ;
            p:P2860 / pq:P3712 ?intention .
      bind(str(year(?dates)) as ?years) .
      OPTIONAL {
        ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
      }
    }
    group by ?work ?type_
  }

  hint:Prior hint:runFirst true .

  # now look up some additional info (only available from the "main" WDQS
  SERVICE <https://query-main-experimental.wikidata.org/sparql> {
    ?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
    MINUS { ?venue_ wdt:P31 wd:Q1143604 }
  }
  bind(
    coalesce(
      if(bound(?type_), ?venue,
      'other source')
    ) as ?role
  )
}
group by ?year ?type_ ?role
order by ?year

It times out.

egonw · 2024-02-10T11:24:03Z

When I run the query from main I get closer, and it runs in reasonable time:

select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?venue_ ?role where {
  # get the intention types from the "main" WDQS
  ?intention wdt:P31 wd:Q96471816 .

  # get the articles from the "scholarly" WDQS
  {
    SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
      select distinct ?work (min(?years) as ?year) ?type_ where {
        ?work wdt:P577 ?dates ;
              p:P2860 / pq:P3712 ?intention .
        bind(str(year(?dates)) as ?years) .
        OPTIONAL {
          ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
        }
      } group by ?work ?type_
    }
  }
  UNION
  # get the articles from the "main" WDQS
  {
    select distinct ?work (min(?years) as ?year) ?type_ where {
      ?work wdt:P577 ?dates ;
            p:P2860 / pq:P3712 ?intention .
      bind(str(year(?dates)) as ?years) .
      OPTIONAL {
        ?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
      }
    }
    group by ?work ?type_
  }

  hint:Prior hint:runFirst true .

  # now look up some additional info: venue
  # get the venue info from the "scholarly" WDQS
  OPTIONAL {
    ?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
    MINUS { ?venue_ wdt:P31 wd:Q1143604 }
  }
  bind(
    coalesce(
      if(bound(?type_), ?venue,
      'other source')
    ) as ?role
  )
}
group by ?year ?type_ ?venue_ ?role
order by ?year

But you can see from the results that the venue information is split over the two QS-s (the above query missed venue info). As soon as I try looking up the venue info from both instances, it times out again.

egonw · 2024-02-11T09:24:52Z

(crossposted from the Telegram Wikicite channel)

I just finished a query that shows how content is scattered over the two splits: https://w.wiki/98km One of the powers of SPARQL is to be able to search the linking ("web"), unlike, for example, label searching. But if we search for a link (the Statement in Wikidata terms), this becomes hard when those links are split too: you effectively have to search in both QSs. This is what I tried yesterday with #2423 (comment) (above): but since SPARQL commonly includes a pattern of two and more links, this is not trivial at all. Indeed, I ran into timeouts. I do not think this is special for Scholia, but applies to any tool that uses SPARQL where Statements are split over the two instances. Of course, this query just looks at one direct claim, and the GitHub issue shows that "two or more" is with qualifiers.

Basically, splitting works if the content can be split. But the power of Wikidata is the complexity of human language, but then with machine readability. Qualifiers are all over the place. So, when i say, "that I feel that Wikidata has failed", more accurately I should say "the query service has failed" and that I think that the QS is a essential part of the eco system (also for Wikibase, for the matter). This is just opinion. Let me stress, the problems are real and we need a real solution. This real solution is hard. This splitting is not the first solution being sought. The Scholia project has been actively looking into alternatives, including a dedicated WDQS, a QS with a lag (but see notes about loading times being days, rather than hours), and the subsetting work (see https://content.iospress.com/articles/semantic-web/sw233491). It is compicated and 5 years ago I has naive and optimistic that computer science would develop a scalable triple store with SPARQL endpoint that meets the Wikidata needs. Sadly, the CS field did not live up to my hopes. So, my tears (":(") are real. And the scalability problems that Wikidata are seeing important and to me very serious and nothing to joke about.

fnielsen · 2024-02-12T08:44:46Z

Where can I find the SPARQL of the Recent publications from experimental scholarly endpoint?

https://synia.toolforge.org/#author/Q18618629 - third table

egonw · 2024-02-12T12:57:40Z

https://synia.toolforge.org/#author/Q18618629 - third table

Yes, got that :) But unlike the other tables, this one does not have a link to the matching query service. I wanted to see the SPARQL itself, not the results.

I think I should be able to find it in the Wiki itself, but the Synia setup I wrote was already too long ago that I can easily find it.

fnielsen · 2024-02-12T16:03:04Z

I wanted to see the SPARQL itself, not the results.

https://www.wikidata.org/wiki/Wikidata:Synia:author#Recent_publications_from_experimental_scholarly_endpoint

egonw · 2024-02-12T17:25:00Z

https://www.wikidata.org/wiki/Wikidata:Synia:author#Recent_publications_from_experimental_scholarly_endpoint

Thanks. Now that I have seen the query, I think that one runs into exactly the problem I experienced and tried to describe.

dpriskorn · 2024-02-14T14:34:18Z

Instead of rewriting or bothering about the split, I suggest we focus on running QLever ourselves and improve it to do what we want no matter the growth of Wikidata. See the discussion I started #2425

physikerwelt · 2024-02-23T18:28:16Z

I discussed that briefly with @Daniel-Mietchen today. To me it seems that the one time split does conceptually not solve any scaling issues and should not be done in the way as it is currently planned. If done, it should be done transparently to the user, i.e. the query might be executed on different back-ends, but it should not be required to change the query.

egonw · 2024-03-02T12:49:49Z

i.e. the query might be executed on different back-ends, but it should not be required to change the query.

What I found is that this is not trivial at all: you cannot simply run a query on both endpoints and then merge the results.

WolfgangFahl · 2024-03-21T20:10:28Z

see #2412 - for a mitigation path

dpriskorn added the question something looking for an answer label Feb 8, 2024

dpriskorn changed the title ~~how does the graph split affect scholia?~~ How does the Wikidata graph split affect scholia? Feb 8, 2024

Daniel-Mietchen added SPARQL the way Scholia queries Wikidata performance the way Scholia treats the machines using it completeness how complete the data in Scholia/Wikidata is compared to what's out there labels Feb 8, 2024

Adafede mentioned this issue Feb 8, 2024

Keep an eye on Wikidata graph split lotusnprod/lotus-search#70

Open

egonw mentioned this issue Mar 16, 2024

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed #2436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does the Wikidata graph split affect scholia? #2423

How does the Wikidata graph split affect scholia? #2423

dpriskorn commented Feb 8, 2024

Daniel-Mietchen commented Feb 8, 2024

fnielsen commented Feb 8, 2024

egonw commented Feb 10, 2024 •

edited

egonw commented Feb 10, 2024 •

edited

egonw commented Feb 10, 2024 •

edited

egonw commented Feb 10, 2024 •

edited

egonw commented Feb 11, 2024

fnielsen commented Feb 12, 2024

egonw commented Feb 12, 2024

fnielsen commented Feb 12, 2024

egonw commented Feb 12, 2024

dpriskorn commented Feb 14, 2024

physikerwelt commented Feb 23, 2024

egonw commented Mar 2, 2024

WolfgangFahl commented Mar 21, 2024

How does the Wikidata graph split affect scholia? #2423

How does the Wikidata graph split affect scholia? #2423

Comments

dpriskorn commented Feb 8, 2024

Context

Question

Daniel-Mietchen commented Feb 8, 2024

fnielsen commented Feb 8, 2024

egonw commented Feb 10, 2024 • edited

egonw commented Feb 10, 2024 • edited

egonw commented Feb 10, 2024 • edited

egonw commented Feb 10, 2024 • edited

egonw commented Feb 11, 2024

fnielsen commented Feb 12, 2024

egonw commented Feb 12, 2024

fnielsen commented Feb 12, 2024

egonw commented Feb 12, 2024

dpriskorn commented Feb 14, 2024

physikerwelt commented Feb 23, 2024

egonw commented Mar 2, 2024

WolfgangFahl commented Mar 21, 2024

egonw commented Feb 10, 2024 •

edited

egonw commented Feb 10, 2024 •

edited

egonw commented Feb 10, 2024 •

edited

egonw commented Feb 10, 2024 •

edited