Skip to content

What is openVirus? (OpenPublishingTalk)

petermr edited this page May 26, 2020 · 7 revisions

what is openVirus?

an 8-minute presentation for https://openpublishingfest.org/calendar.html#event-178 on 2020-05-27

Origins and motivation

Sparked by realisation around 2020-02 that there was no simple way for citizens to find scientific information on the COVID-19 epidemic. A group of activists released about 5000 papers from SciHub [1]. Possibly in response a number of closed access publishers released a few thousand articles into the "CORD-19" database. this was restricted to viruses or COVID. (It's since developed into a larger user community). content was largely JSON.

We were working on OpenClimateKnowledge (OCK), for citizens to extract knowledge from the distributed scientific literature. When COVID-19 hit, we decided to use the same technology to tackle viral epidemics.

We felt that the selection of a very narrow section of the scientific literature , selected by commercial publishers, was a minimal response. With simple searches we found that 60-90% of the literature was still closed for topics such as aerosols, masks, ventilators, social distances, legal issues and many others. Citizens are confined to information on:

  • topics selected by publishers
  • sources of content restricted by current systems

openVirus was created as a citizen volunteer community to create tools and sources for citizens to ask their own questions of their own sources.

Principles

  • to welcome Open (free to use, re-use and re-distribute)
  • to create a single point of entry for searching the Open Literature
  • to provide a toolset that citizens could download, modify and use
  • to create a Wikidata-based query, using simple dictionaries that citizens can create and modify
  • to create an atmosphere where a community can grow.
  • to emphasize globalness such as multilinguality and GlobalSouth publications.
  • to use the most appropriate Open solutions. Collaborate not compete.

Strategy

largely carried out by users on their own machines.

Many resources are server-centric and offer limited chance of systematic download.

  • build scrapers or API query tools for Openly readable sources.
  • query or scrape user questions
  • download raw content (PDF, HTML, images) - 10 - 10,000 articles
  • clean and semantify
  • annotate with dictionaries
  • expose , analyze, display.

Sources

  • EuropePMC
  • biorxiv and medrxiv
  • DOAJ
  • EThOS
  • Redalyc (MX)

Toolkit

any tool can be included as long as it can communicate through files on local storage in our CProject format.. This is not an exclusive list.

  • framework: ami + CProject data
  • scrapers: getpapers, Ferret, curl, scrapy
  • cleaners: PDFBox, Tidy/Jsoup, etc. Grobid
  • transformers: xml2html, ami ocr, KNIME
  • dictionaries: ami dictionary
  • indexing and annotation: Solr, ami
  • Analysis and display: R, KNIME

The central philosophy is a defined *semantic universal data structure, CProject. The tools can be varied or swapped.

Contributors

  • Remko Popma,
  • Lezan Hawizy, Tim Voronov,
  • Andy Jackson,
  • Clyde Davies,
  • Thomas Shafee,
  • Priya JK , Kareena Singh,
  • Simon Worthington, (check omissions)

Endproduct

  • toolkit
  • dictionaries
  • tutorials
  • citizen openVirus downloadable or boxed

====

[1] Bender, Maddie (3 February 2020). "'It's a Moral Imperative:' Archivists Made a Directory of 5,000 Coronavirus Studies to Bypass Paywalls". Vice. https://www.vice.com/en_us/article/z3b3v5/archivists-are-bypassing-paywalls-to-share-studies-about-coronaviruses

[CORD-19] (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)

Clone this wiki locally