Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert templated queries to named queries and separate concerns by introducing named query middleware #2412

Open
WolfgangFahl opened this issue Jan 26, 2024 · 12 comments
Labels
enhancement some suggestions to improve Scholia

Comments

@WolfgangFahl
Copy link
Collaborator

WolfgangFahl commented Jan 26, 2024

Is your feature request related to a problem? Please describe.
blazegraph is getting close to the 4TB limit. Wikimedia foundation is testing a graph split in Q1/2024.
This will eventually and likeley force the use of:

  • federated queries
  • different SPARQL endpoint(s)
  • different triplestores
  • different flavors of SPARQL
  • may be even different query languages

also there is the already limiting timeout of 1 min of the official WDQS

Describe the solution you'd like

  • Change most or all relevant queries to named queries with parameters
  • Call a middleware to run the queries
  • Let the middleware do the necessary translation

Describe alternatives you've considered
Get your own copy of wikidata and use it see CEUR-WS Vol-3262 paper Getting and hosting your own copy of Wikidata

Additional context

Search Platfrom Office Hours 2023-12-06

Named Query handling:

Queries may be referenced theses days with e.g. short urls which are boths supported by the Wikdata Query Service and QLever. Personally i think it would be good to go one step futher and have "named queries". See e.g. https://cr.bitplan.com/index.php/List_of_Queries as a example for queries. Scholia also uses a similar idea internally. See https://github.com/WDscholia/scholia/tree/master/scholia/app/templates. Quite a few of these queries have no only a few parameters. E.g. https://github.com/WDscholia/scholia/blob/master/scholia/app/templates/author_topics.sparql only takes a single q - identifier has input.

In my own pylodstorage project https://pypi.org/project/pyLodStorage/ i am already offering named queries but without parameters. WolfgangFahl/pyLoDStorage#113 is the issue to parameterize the queries. The queries are described in Yaml files in this solution. I imagine a RESTFul service that takes a query name and a set of parameters and returns the result in a SPARQL server compatible way. This would mean that the details of the Query (e.g. whether it is federated or on which endpoint it runs) are hidden. I believe that this approach would work well with the intended Wikidata Split attempt in QI / 2024.

Links:

Previous analysis of blazegraph alternatives:

Qlever federation

Scaling Wikidata Query Service - Split the Graph experiment

@WolfgangFahl WolfgangFahl added the enhancement some suggestions to improve Scholia label Jan 26, 2024
@WolfgangFahl
Copy link
Collaborator Author

WolfgangFahl commented Jan 26, 2024

@WolfgangFahl
Copy link
Collaborator Author

see also #2063 and ad-freiburg/qlever#859

@fnielsen
Copy link
Collaborator

I am trying to understand this. Do you propose the use of FROM against and special endpoint that distribute queries? Where does federation comes in?

@WolfgangFahl
Copy link
Collaborator Author

WolfgangFahl commented Jan 29, 2024

@fnielsen the intention is to do information hiding and don't reveal what the actual query looks like. Take
author_events as an example that query has the name "author_events" and a single QID parameter.

The specific query for your personal QID Q20980928 on QLever would e.g. https://qlever.cs.uni-freiburg.de/wikidata/084HGc and doesnot run out of the box The query on Wikidata does give 71 results but the URL shortening fails so i can't give a short link here and purposely i don't intend to show the details of the query. you'd just be interested in the result.

our pyLodStorage library already allows commands such as:

sparqlquery -qp wikidata.yaml -qn author_events_fan -f github

which will pick up the query specification from a yaml file with author_events_fan - named query spec(see result below).
The proposal here is to offer the same behavior as a SPARL endpoint compatible web service that hides all technical details. That way if a query needs to be rewritten to a federated query we may do so "behind the science" in the blackbox we are providing. We might even check whether the result is the same as without the federation.

author_events_fan

try it!

result

date event eventLabel eventUrl roles locations
2023-09-13 http://www.wikidata.org/entity/Q117314306 First Wikibase Lexical Data Workshop /event/Q117314306 speaker Centre for Translation Studies
2023-05-28 http://www.wikidata.org/entity/Q115781177 ESWC 2023 /event/Q115781177 participant Aldemar Knossos Royal
2023-05-28 http://www.wikidata.org/entity/Q115972632 Semantic Technologies for Scientific, Technical and Legal Data /event/Q115972632 speaker, author Aldemar Knossos Royal
2023-05-28 http://www.wikidata.org/entity/Q121334813 ESWC 2023 Workshops and Tutorials /event/Q121334813 author Chersonesos
2023-05-22 http://www.wikidata.org/entity/Q115497966 The 24th Nordic Conference on Computational Linguistics /event/Q115497966 author Tórshavn
2023-05-11 http://www.wikidata.org/entity/Q114794722 Wiki Workshop 2023 /event/Q114794722 author
2022-11-30 http://www.wikidata.org/entity/Q113956029 Sprogteknologisk Konference 2022 /event/Q113956029 participant Søndre Campus
2022-11-07 http://www.wikidata.org/entity/Q113954954 Danish Data Science 2022 /event/Q113954954 participant Hotel LEGOLAND
2021-11-16 http://www.wikidata.org/entity/Q108377974 Sprogteknologisk Konference 2021 /event/Q108377974 participant Søndre Campus
2021-10-25 http://www.wikidata.org/entity/Q106591764 Deep Learning for Knowledge Graphs 2021 /event/Q106591764 program committee member
2021-10-24 http://www.wikidata.org/entity/Q106429029 The 2nd Wikidata Workshop /event/Q106429029 program committee member
2021-05-31 http://www.wikidata.org/entity/Q102274071 The 23rd Nordic Conference on Computational Linguistics /event/Q102274071 author Reykjavík University
2021-04-14 http://www.wikidata.org/entity/Q104835330 Wiki Workshop 2021 /event/Q104835330 participant
2020-11-02 http://www.wikidata.org/entity/Q86530254 The 1st Wikidata Workshop /event/Q86530254 program committee member
2020-10-26 http://www.wikidata.org/entity/Q100741900 WikiCite 2020 Virtual conference /event/Q100741900 speaker, participant online
2020-10-19 http://www.wikidata.org/entity/Q98083516 Combining Symbolic and Sub-symbolic methods and their Applications /event/Q98083516 program committee member Galway
2020-09-01 http://www.wikidata.org/entity/Q102070516 Digitally support Environment Assessment for Sustainable Development Goals /event/Q102070516 participant
2020-06-22 http://www.wikidata.org/entity/Q79137947 7th Workshop on Linked Data in Linguistics /event/Q79137947 author
2020-06-01 http://www.wikidata.org/entity/Q84430072 3rd Workshop on Quality of Open Data /event/Q84430072 program committee member University of Colorado, at Colorado Springs
2020-05-31 http://www.wikidata.org/entity/Q83793571 Deep Learning for Knowledge Graphs 2020 /event/Q83793571 program committee member Chersonesos
2020-05-26 http://www.wikidata.org/entity/Q94759294 WikiLunch /event/Q94759294 participant German National Library of Science and Technology, World Wide Web, Wikiversity
2020-05-26 http://www.wikidata.org/entity/Q94495218 #vBIB20 /event/Q94495218 speaker German National Library of Science and Technology, World Wide Web
2019-10-25 http://www.wikidata.org/entity/Q42449814 WikidataCon 2019 /event/Q42449814 speaker Urania
2019-10-09 http://www.wikidata.org/entity/Q63686495 Conference on Natural Language Processing 2019 /event/Q63686495 author Kollegienhaus
2019-09-09 http://www.wikidata.org/entity/Q59917009 SEMANTiCS 2019 /event/Q59917009 participant, author Karlsruhe
2019-08-01 http://www.wikidata.org/entity/Q48010913 Wikimania 2019 /event/Q48010913 speaker Stockholm University
2019-07-23 http://www.wikidata.org/entity/Q61983755 The 10th Global WordNet Conference /event/Q61983755 participant, author Wrocław University of Science and Technology
2019-06-26 http://www.wikidata.org/entity/Q61141551 2nd Workshop on Quality of Open Data /event/Q61141551 program committee member Seville
2019-06-17 http://www.wikidata.org/entity/Q59979937 5th International Conference on Computational Social Science /event/Q59979937 program committee member University of Amsterdam
2019-06-02 http://www.wikidata.org/entity/Q60808888 Workshop at ESWC 2019 on Deep Learning for Knowledge Graphs /event/Q60808888 program committee member Grand Hotel Bernardin
2019-06-02 http://www.wikidata.org/entity/Q59620529 ESWC 2019 /event/Q59620529 participant, author Grand Hotel Bernardin
2019-05-17 http://www.wikidata.org/entity/Q44062313 Wikimedia Hackathon 2019 /event/Q44062313 participant National Library of Technology building
2019-04-16 http://www.wikidata.org/entity/Q63171054 Women in Data Science Conference 2019 Copenhagen /event/Q63171054 participant IT University of Copenhagen
2019-03-29 http://www.wikidata.org/entity/Q59848782 Wikimedia Summit 2019 /event/Q59848782 participant Mercure Hotel Berlin Tempelhof Airport
2018-11-27 http://www.wikidata.org/entity/Q55117737 WikiCite 2018 /event/Q55117737 speaker, participant David Brower Center
2018-11-06 http://www.wikidata.org/entity/Q55910942 Second Linked Open Citation Database Workshop /event/Q55910942 speaker Mannheim Palace
2018-10-03 http://www.wikidata.org/entity/Q56876300 Research Output & Impact Analyzed and Visualized: Concluding Conference /event/Q56876300 speaker DGI-byen
2018-09-25 http://www.wikidata.org/entity/Q48563023 10th International Conference on Social Informatics /event/Q48563023 program committee member St. Petersburg
2018-09-03 http://www.wikidata.org/entity/Q51955163 Workshop on Open Citations /event/Q51955163 speaker University of Bologna
2018-07-20 http://www.wikidata.org/entity/Q48548111 1st Workshop on Quality of Open Data /event/Q48548111 program committee member Berlin
2018-07-12 http://www.wikidata.org/entity/Q47482917 4th Annual International Conference on Computational Social Science /event/Q47482917 program committee member Kellogg School of Management
2018-06-04 http://www.wikidata.org/entity/Q48621961 1st International Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies /event/Q48621961 participant, author Aldemar Knossos Royal
2018-06-03 http://www.wikidata.org/entity/Q54496448 3rd International Workshop on Geospatial Linked Data /event/Q54496448 participant, author Aldemar Knossos Royal
2018-06-03 http://www.wikidata.org/entity/Q50290385 ESWC 2018 /event/Q50290385 participant Aldemar Knossos Royal
2018-05-27 http://www.wikidata.org/entity/Q47501229 11th International Conference on Chemical Structures /event/Q47501229 author Noordwijkerhout
2018-05-01 http://www.wikidata.org/entity/Q30087264 Wikimedia Hackathon 2018 /event/Q30087264 participant Bellaterra Campus
2018-04-24 http://www.wikidata.org/entity/Q47035167 Wiki Workshop 2018 /event/Q47035167 participant, author Palais des congrès de Lyon
2018-04-23 http://www.wikidata.org/entity/Q48910401 The Web Conference 2018 /event/Q48910401 participant, author Palais des congrès de Lyon
2018-04-20 http://www.wikidata.org/entity/Q50132215 Wikimedia Conference 2018 /event/Q50132215 participant Mercure Hotel Berlin Tempelhof Airport
2018-01-09 http://www.wikidata.org/entity/Q64864052 Teaching platform for developing and automatically tracking early stage literacy skill /event/Q64864052 participant
2017-11-17 http://www.wikidata.org/entity/Q43254255 8th Language & Technology Conference /event/Q43254255 speaker, participant, author Poznań
2017-10-28 http://www.wikidata.org/entity/Q37807682 WikidataCon 2017 /event/Q37807682 speaker, participant Tagesspiegel building
2017-09-13 http://www.wikidata.org/entity/Q48612170 9th International Conference on Social Informatics /event/Q48612170 program committee member Wolfson College
2017-09-07 http://www.wikidata.org/entity/Q28052808 2017 Conference on Empirical Methods in Natural Language Processing /event/Q28052808 participant Øksnehallen, DGI-byen, Copenhagen
2017-05-28 http://www.wikidata.org/entity/Q30090453 ESWC 2017 /event/Q30090453 participant, author Portorož
2017-05-28 http://www.wikidata.org/entity/Q113625218 1st International Workshop on Scientometrics /event/Q113625218 author Portorož
2017-05-28 http://www.wikidata.org/entity/Q113744888 1st International Workshop on Enabling Decentralised Scholarly Communication /event/Q113744888 author Portorož
2017-05-19 http://www.wikidata.org/entity/Q28053831 Wikimedia Hackathon 2017 /event/Q28053831 participant JUFA Wien City
2017-03-31 http://www.wikidata.org/entity/Q29169189 Wikimedia Conference 2017 /event/Q29169189 participant
2017-01-01 http://www.wikidata.org/entity/Q54856362 WikiCite 2017 /event/Q54856362 participant Vienna
2016-06-16 http://www.wikidata.org/entity/Q24632656 The People's Meeting 2016 /event/Q24632656 participant Allinge
2016-05-17 http://www.wikidata.org/entity/Q75540679 Wiki Workshop 2016, ICWSM 2016 /event/Q75540679 author Cologne
2014-01-01 http://www.wikidata.org/entity/Q14506843 Wikimania 2014 /event/Q14506843 participant Barbican Centre
2012-05-28 http://www.wikidata.org/entity/Q113505637 2nd Workshop on Semantic Publishing /event/Q113505637 author Chersonesos
2012-05-27 http://www.wikidata.org/entity/Q42431329 ESWC 2012 /event/Q42431329 author Aldemar Knossos Royal
2011-05-30 http://www.wikidata.org/entity/Q113659299 ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages /event/Q113659299 author Heraklion
2010-01-01 http://www.wikidata.org/entity/Q14507062 Wikimania 2010 /event/Q14507062 participant Gdańsk
2008-01-01 http://www.wikidata.org/entity/Q11756041 Wikimania 2008 /event/Q11756041 participant Alexandria
2004-12-13 http://www.wikidata.org/entity/Q73025763 Neural Information Processing Systems 2004 /event/Q73025763 author Whistler, Vancouver
2000-06-05 http://www.wikidata.org/entity/Q75936725 ICASSP 2000 /event/Q75936725 author Istanbul
http://www.wikidata.org/entity/Q114647284 Wikidata WikiProject COVID-19 /event/Q114647284 participant

@WolfgangFahl
Copy link
Collaborator Author

@WolfgangFahl
Copy link
Collaborator Author

We have been hard at work on our Graph Split experiment [1], and we
now have a working graph split that is loaded onto 3 test servers. We
are running tests on a selection of queries from our logs to help
understand the impact of the split. We need your help to validate the
impact of various use cases and workflows around Wikidata Query
Service.

What is the WDQS Graph Split experiment?

We want to address the growing size of the Wikidata graph by splitting
it into 2 subgraphs of roughly half the size of the full graph, which
should support the growth of Wikidata for the next 5 years. This
experiment is about splitting the full Wikidata graph into a scholarly
articles subgraph and a “main” graph that contains everything else.

See our previous update for more details [2].

Who should care?

Anyone who uses WDQS through the UI or programmatically should check
the impact on their use cases, scripts, bots, code, etc.

What are those test endpoints?

We expose 3 test endpoints, for the full, main and scholarly articles
graphs. Those graphs are all created from the same dump and are not
live updated. This allows us to compare queries between the different
endpoints, with stable / non changing data (the data are from the
middle of October 2023).

The endpoints are:

Each of the endpoints is backed by a single dedicated server of
performance similar to the production WDQS servers. We don’t expect
performance to be representative of production due to the different
load and to the lack of updates on the test servers.

What kind of feedback is useful?

We expect queries that don’t require scholarly articles to work
transparently on the “main” subgraph. We expect queries that require
scholarly articles to need to be rewritten with SPARQL federation
between the “main” and scholarly subgraphs (federation is supported
for some external SPARQL servers already [3], this just happens to be
for internal server-to-server communication). We are doing tests and
analysis based on a sample of query logs.

We want to hear about:

General use cases or classes of queries which break under federation
Bots or applications that need significant rewrite of queries to work
with federation
And also about use cases that work just fine!

Examples of queries and pointers to code will be helpful in your feedback.

Where should feedback be sent?

You can reach out to us using the project’s talk page [1], the
Phabricator ticket for community feedback [4] or by pinging directly
Sannita (WMF) [5].

Will feedback be taken into account?

Yes! We will review feedback and it will influence our path forward.
That being said, there are limits to what is possible. The size of the
Wikidata graph is a threat to the stability of WDQS and thus a threat
to the whole Wikidata project. Scholarly articles is the only split we
know of that would reduce the graph size sufficiently. We can work
together on providing support for a migration, on reviewing the rules
used for the graph split, but we can’t just ignore the problem and
continue with a WDQS that provides transparent access to the full
Wikidata graph.

Have fun!

  Guillaume

[1] https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split
[2] https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/October_2023_scaling_update
[3] https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Federation
[4] https://phabricator.wikimedia.org/T356773
[5] https://www.wikidata.org/wiki/User:Sannita_(WMF)

Guillaume Lederrey (he/him)
Engineering Manager
Wikimedia Foundation

@WolfgangFahl
Copy link
Collaborator Author

There is now a Wikimedia Hackathon 2024 project task for this https://phabricator.wikimedia.org/T363894

@WolfgangFahl
Copy link
Collaborator Author

Check out http://snapquery.bitplan.com/query/scholia/author_list-of-publications
with Q80 - Tim Berners-Lee to get
grafik

http://snapquery.bitplan.com has the demo and project is at https://github.com/WolfgangFahl/snapquery with further links to the Hackathon results - thanks to Tim and Dennis for making this happen!

@fnielsen
Copy link
Collaborator

fnielsen commented May 6, 2024

Check out http://snapquery.bitplan.com/query/scholia/author_list-of-publications with Q80 - Tim Berners-Lee to get grafik

http://snapquery.bitplan.com has the demo and project is at https://github.com/WolfgangFahl/snapquery with further links to the Hackathon results - thanks to Tim and Dennis for making this happen!

I get TimeoutError: No connection after 3.0 seconds

@WolfgangFahl
Copy link
Collaborator Author

@fnielsen there is another server at https://snapquery.wikidata.dbis.rwth-aachen.de/query/scholia/author_list-of-publications which might work. A socket connection is created which might not work behind firewalls or on internet connections with high latency.

@WolfgangFahl
Copy link
Collaborator Author

version 0.0.8 of snapquery is ready. It has e.g.
http://snapquery.bitplan.com/api/meta_query/params_stats.github

params_stats

query

SELECT count(*),
    params 
FROM "QueryDetails" 
GROUP BY params 
ORDER BY 1 desc

result

count(*) params
374
293 q
14 q1,q2
9 q,q
3 q,q,q
3 p
1 q,q2
1 q,q,q,q,q
1 q,doi,q,doi,q,doi,q,doi,q,doi
1 lexeme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement some suggestions to improve Scholia
Projects
None yet
Development

No branches or pull requests

3 participants