COVID-19 Literature Analysis using Machine Learning and Deep Learning

Introduction

The coronavirus pandemic caused enormous health, economic, environmental, and social challenges to the entire human population. The entire research community worked tirelessly for a vaccine but could we help speeding up these efforts even more?

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups prepared a COVID-19 Open Research Dataset (CORD-19). It is a resource of over 1 million scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset was provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

This project aims to help researchers navigate this fast-growing body of coronavirus literature to efficiently find relevant and up-to-date information. This is done by using various topic modeling algorithms to cluster similar papers together. We leverage Hadoop for data storage management and PySpark for building ML and DL pipelines.

Dataset Description

Dataset consists of JSON and CSV files. Each paper is saved in a nested JSON file while some additional metadata is available in a CSV file. A detailed description is available here. Below image summarizes the data preprocessing pipeline.

Methodology

Graph Database

Graph databases provide a way to generate and visualize relationships between entities
Both Pyspark GraphFrame and neo4j can achieve graph-based data storage. We explored both the tools
Each author, paper, and journal acts as a node
All nodes are connected as per relationships – “has_published” or “has_paper”
Data was prepared using python to make it ready to import to neo4j
Docker was used to install the neo4j (neo4j version 5.2.0)
Bash script (start_neo4j.sh) starts the docker container, neo4j server and imports the data

Results

Below are a few sample results of topic modeling

Topic 1 seem to be concerned with immune response and antibodies
Topic 2 seem to be talking about effects of pandemic on society, mental health (stress, anxiety) and work environment (behavior, support)
Topic 3 papers could be related to infection detection, antibody sequencing and virus itself

Folder Structure

covid19-literature-analysis
  |
  |--- data_prep: Code for preprocessing the raw data
         |--- cord19-parser.py: A python parser to convert the raw data into a structured CSV file
         |--- Data-Preprocessing.ipynb: Data parser but using PySpark
  |--- data_viz: Some visualizations to understand the data better
  |--- graph_db: Post project exploratory work to store and represent data using neo4j and PySpark GraphFrames
  |--- images: README file images
  |--- modeling: Modeling work
  |--- ppt: Contains a presentation describing the whole project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_prep

data_prep

data_viz

data_viz

graph_db

graph_db

images

images

modeling

modeling

ppt

ppt

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

COVID-19 Literature Analysis using Machine Learning and Deep Learning

Table of contents

Introduction

Dataset Description

Methodology

Graph Database

Results

Folder Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data_prep		data_prep
data_viz		data_viz
graph_db		graph_db
images		images
modeling		modeling
ppt		ppt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

kmnis/covid19-literature-analysis

Folders and files

Latest commit

History

Repository files navigation

COVID-19 Literature Analysis using Machine Learning and Deep Learning

Table of contents

Introduction

Dataset Description

Methodology

Graph Database

Results

Folder Structure

About

Topics

Resources

License

Stars

Watchers

Forks

Languages