Skip to content

nasa-jpl-cord-19/covid19-knowledge-graph

Repository files navigation

COVID-19 Research Knowledge Graph

Builds a knowledge graph from the COVID-19 Open Research Dataset (CORD-19) dataset. As of 2020-03-18 it has been run against the Commercial use subset (includes PMC content) -- 9000 papers, 186Mb.

This project is written is Scala... you require sbt to continue.

Prerequsites

  • Install sbt
  • Download the Commercial use subset and extract it to some local directory
  • Clone dair-iitd/OpenIE-standalone and follow the build instructions
    • git clone https://github.com/dair-iitd/OpenIE-standalone.git && cd OpenIE-standalone
    • sbt -J-Xmx10000M clean compile assembly
    • java -Xmx10g -XX:+UseConcMarkSweepGC -jar target/scala-2.10/openie-assembly-5.0-SNAPSHOT.jar --httpPort 8000
    • To get an extraction from the server use the POST request on /getExtraction endpoint to POST sentences. The sentence will go in the body of HTTP request. An example of curl request curl -X POST http://localhost:8000/getExtraction -d "The Jet Propulsion Laboratory is a federally funded research and development center and NASA field center in the city of La Canada Flintridge with a Pasadena mailing address, within the state of California, United States."

Installation

Back in this directory...

Launch sbt:

$ sbt compile

Running

From sbt

Launch sbt:

$ sbt

Run the program with an argument indicating the input data directory containing the dataset:

> run /path/to/directory/containing/individual/CORD-19_files /path/to/directory/containing/individual/annie_extra ction_files

As a standalone JAR

First assemble the JAR

$ sbt assembly

... then run jar via java

$ java -jar ./target/scala-2.13/covid19_knowledge_graph-assembly-0.1.0-SNAPSHOT.jar

Output

Once the program runs (this may take some time depending on how much memory your machine has) you will find a newly written file called covid19_knowledge_graph.ttl. This file can be loaded into Apache Jena's Fuseki server (or any other SPARQL server which permits ingest of TTL RDF graphs).

Querying Data

Once the data is loaded into Fuseki, you can use Jena's powerful full text search which combines SPARQL and full text search via Lucene or ElasticSearch (built on Lucene). It gives applications the ability to perform indexed full text searches within SPARQL queries.

Contact

Dr. Lewis John McGibbney Ph.D., B.Sc.(Hons)

Enterprise Search Technologist

Web and Mobile Application Development Group (172B)

Application, Consulting, Development and Engineering Section (1722)

Info & Engineering Technology Planning and Development Division (1720)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 600-172A

Tel: (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax: (+1) (818)-393-1190

Email: lewis.j.mcgibbney@jpl.nasa.gov

ORCID: orcid.org/0000-0003-2185-928X