Skip to content

kmnis/covid19-literature-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Literature Analysis using Machine Learning and Deep Learning

Table of contents

  1. Introduction
  2. Dataset Description
  3. Methodology
  4. Graph Database
  5. Results
  6. Folder Structure

Introduction

The coronavirus pandemic caused enormous health, economic, environmental, and social challenges to the entire human population. The entire research community worked tirelessly for a vaccine but could we help speeding up these efforts even more?

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups prepared a COVID-19 Open Research Dataset (CORD-19). It is a resource of over 1 million scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset was provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

This project aims to help researchers navigate this fast-growing body of coronavirus literature to efficiently find relevant and up-to-date information. This is done by using various topic modeling algorithms to cluster similar papers together. We leverage Hadoop for data storage management and PySpark for building ML and DL pipelines.

Dataset Description

Dataset consists of JSON and CSV files. Each paper is saved in a nested JSON file while some additional metadata is available in a CSV file. A detailed description is available here. Below image summarizes the data preprocessing pipeline.

Data Preprocessing

Methodology

Methodology

Graph Database

  • Graph databases provide a way to generate and visualize relationships between entities
  • Both Pyspark GraphFrame and neo4j can achieve graph-based data storage. We explored both the tools
  • Each author, paper, and journal acts as a node
  • All nodes are connected as per relationships – “has_published” or “has_paper”
  • Data was prepared using python to make it ready to import to neo4j
  • Docker was used to install the neo4j (neo4j version 5.2.0)
  • Bash script (start_neo4j.sh) starts the docker container, neo4j server and imports the data

sample-graph

final-graph

Results

Below are a few sample results of topic modeling

Topic 1 Topic 2 Topic 3

  • Topic 1 seem to be concerned with immune response and antibodies
  • Topic 2 seem to be talking about effects of pandemic on society, mental health (stress, anxiety) and work environment (behavior, support)
  • Topic 3 papers could be related to infection detection, antibody sequencing and virus itself

Folder Structure

covid19-literature-analysis
  |
  |--- data_prep: Code for preprocessing the raw data
         |--- cord19-parser.py: A python parser to convert the raw data into a structured CSV file
         |--- Data-Preprocessing.ipynb: Data parser but using PySpark
  |--- data_viz: Some visualizations to understand the data better
  |--- graph_db: Post project exploratory work to store and represent data using neo4j and PySpark GraphFrames
  |--- images: README file images
  |--- modeling: Modeling work
  |--- ppt: Contains a presentation describing the whole project

About

COVID-19 Literature Analysis using Machine Learning and Deep Learning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages