CDAC Hackathon

COVID-19 LITERATURE ANALYSIS AND SUMMARIZATION PLATFORM

Use this link to access the google drive which contains our dataset,word embeddings, PPT for the project and the sample videos

Click here to open the text file used in the video recording

Click here to access the videos on GitHub

Click here to access the videos on Drive

Click here for the PPT for round 1

Click here for the PPT for round 2

Click here for the final documentation of round 2

Steps to run the project:

Requirements:
1. python: version 3.8.x
2. yarn: version 1.22.x
3. node: version 12.16.x
4. pip and virtualenv.
Steps:
1. Clone the repository using
  git clone https://github.com/tanmaypardeshi/CDAC-Hackathon.git
2. Download the glove folder from the google drive link provided above and save it in the project directory.
3. Download all the other csv and json files from the google drive link and store it in the data folder of the project directory.
4. Use command virtualenv venv in project directory to create virtualenv.
5. Use source venv/bin/activate to activate virtualenv.
6. For the first time, use pip install -r requirements.txt in project directory to install all dependencies
  This will only be for the first time to install the packages.
7. Navigate to the frontend folder and run yarn install for the first time to install all javascript dependencies for React.
8. To run the flask server use python run.py in the project directory.
9. Navigate to the frontend folder and run yarn start to start development server and use the platform while keeping the flask server running as well.
10. Use deactivate to deactivate virtualenv.

Documentation about the files in the repository

(Click on the links to open the folder)

1. glove: Embeddings used to perform text summarization and information retrieval for Real Time Research News.

2. summariser.py: Makes use of the TextRank algorithm to summarize the input Biomedical Text.

3. ir_author.py: Makes use of levenshtein distance to generate a similarity score between the author based query and documents

4. ir_title.py: Makes use of levenshtein distance and keyword indexing to generate a similarity score between the title based query and documents

5. ir_optimised.py: Makes use of levenshtein distance and keyword indexing along with a keywords pickle file to generate a similarity score between the author based query and documents

6. news.py: Makes use of scraping techniques to retrieve unstructured COVID-19 research news from the internet and makes use of info retrieval to display relevant results on the basis of a query.

7. Q&A_CDQA_Finetuning.py: The script written to fine-tune BERT with respect to a subset of CORD-19 dataset

8. Anomaly_detection.py: The script written to fine-tune BERT with respect to a subset of CORD-19 dataset

9. qna.joblib Trained model which predicts answers on the basis of the question query

10. ir_old.csv Dataset created by using CORD-19 data for Information Retrieval

Research papers referred while working for the project:

(Click on the links to open the research paper)

Unsupervised Text Summarization Using Sentence Embeddings:

This research paper explains the process of text summarization using unsupervised methods. It is done by clustering sentence embeddings trained to embed paraphrases near each other.
Application and analysis of text summarization for biomedical domain content:

In this research paper, the approach is to implement and analyse abstractive and extractive text summarization machine learning models forgeneral language as well as biomedical domain-specific text. For abstractive text summarization, a sequence-to-sequence model that utilizes recurrent neural networks (RNNs) for biomedical text summarization. For extractive text summarization, a pretrained BERT model is used.
Supervised Machine Learning for Extractive Query Based Summarisation of Biomedical Data:

This paper explores the impact of several supervised machine learning approaches for extracting multi-document summaries for given queries. It compares classification and regression approaches for query-based extractive summarisation using data provided by the BioASQ Challenge.
Information Retrieval as Statistical Translation:

This paper proposes a new probabilistic approach to information retrieval based upon the ideas of statistical machine translation. The main approach is a statistical model on how a document can be translated into a query.
Statistical Language Modeling For Information Retrieval

This paper reviews research and applications in statistical language modelling for information retrieval (IR) that has emerged within the past several years as a new probabilistic framework for describing information retrieval processes.
Unsupervised Question Answering by Cloze Translation

This research paper explores to what extent high quality training data is actually required for Extractive QA, and investigates the possibility of unsupervised Extractive QA. This problem is approached by first learning to generate context, question and answer triplets in an unsupervised manner, which we then use to synthesize Extractive QA training data automatically.
A review on anomaly detection in disease outbreak detection

Gives a brief description about detection of pandemic like anomalies using AI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The paper explores the architecture of the current State Of The Art Language Representation Model - BERT

Snippets of the platform:

Welcome modal

News

Login and Signup

Summarization and My Summaries

Information Retrieval and My Bookmarks

Q & A and My Questions

Anomaly Detection Map

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
apis		apis
data		data
deployment		deployment
frontend		frontend
screenshots		screenshots
videos		videos
.gitignore		.gitignore
Anomaly_detection.py		Anomaly_detection.py
Documentation Round 2.pdf		Documentation Round 2.pdf
LICENSE		LICENSE
PPT Round 1.pdf		PPT Round 1.pdf
PPT Round 2.pdf		PPT Round 2.pdf
Q&A_CDQA_Finetuning.py		Q&A_CDQA_Finetuning.py
README.md		README.md
ir_author.py		ir_author.py
ir_optimised.py		ir_optimised.py
ir_title.py		ir_title.py
news.py		news.py
requirements.txt		requirements.txt
run.py		run.py
summariser.py		summariser.py

License

kausko/CLASP

Folders and files

Latest commit

History

Repository files navigation

CDAC Hackathon

COVID-19 LITERATURE ANALYSIS AND SUMMARIZATION PLATFORM

Steps to run the project:

Documentation about the files in the repository

(Click on the links to open the folder)

Research papers referred while working for the project:

(Click on the links to open the research paper)

Snippets of the platform:

Welcome modal

News

Login and Signup

Summarization and My Summaries

Information Retrieval and My Bookmarks

Q & A and My Questions

Anomaly Detection Map

About

Topics

Resources

License

Stars

Watchers

Forks

Languages