Skip to content

TabassumTanzim/multilabel-paper-task-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiLabel-Paper-Task-Classifier

A text classification model built from scatch consisting of data collection, model training and deployement. The model can classify 258 tasks enlisted here

Data Collection

Data extraction began by retrieving information from papers with code in two steps. The initial step involved obtaining paper URLs, which led to the formation of paper_urls dataset comprising paper titles and their corresponding links with paper_url.ipynb

Following this, each URL in the dataset was visited to extract abstracts and the associated tasks from the papers using url_details.ipynb, ultimately forming the primary dataset.

In total 26778 paper details have been scraped.

Data Pre-processing

At first, there were 2397 different tasks in the dataset. After looking closely, I found that 2139 of them were tasks. So, I removed those tasks, leaving 258 tasks. Then, I got rid of the abstracts without any task, and that left me with 26628 samples.

Model Training

I fine-tuned a distilroberta-base model from HuggingFace Transformers using Fastai and Blurr. You can check out the notebook for the model training here.

Model Compression and ONNX Inference

The model that underwent training has a memory size of 314+MB. I compressed this model using ONNX quantization and reduced its size to below 83MB.

Model Deployement

The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or here

image

Web Deployement:

Deployed a flask app built for users to provide abstract as input and to get tasks as output. You can check the flask branch. The website is live here image image