Skip to content

A multi-label text classification model which can classify the tasks based on the abstract of publication paper.

License

Notifications You must be signed in to change notification settings

Tasfiq-K/from-paper-with-tasks

Repository files navigation

Multilabel Task Classifier from Paper Abstract


From Paper With Tasks

A text classification model from data collection, model training, and deployment.
The model can classify 260 different types of paper tasks
The keys of json_files/task_types_encoded.json shows the paper tasks

Data Collection

Data was collected from paperswithcode

Data was collected from the categoreis below:

  1. Computer Vision

    • Convolutional Neural Networks
    • Generative Models
    • Image Model Blocks
    • Object Detections Models
    • Image Feature Extractors

  2. Natural Language Processing

    • Language Models
    • Transformers
    • Word Embeddings
    • Attention Patterns
    • Sentence Embeddings

  3. Reinforcement Learning

    • Policy Gradient Methods
    • Off-Policy TD Control
    • Reinforcement Learning Frameworks
    • Q-Learning Networks
    • Value Function Estimation

  4. Audio

    • Generative Audio Models
    • Audio Model Blocks
    • Text-to-Speech Models
    • Speech Separations Models
    • Speech Recognition

  5. Sequential

    • Recurrent Neural Networks
    • Sequence to Sequence Models
    • Time Series Analysis
    • Temporal Convolutions
    • Bidirectional Recurrent Neural Networks

  6. Graphs

    • Graph Models
    • Graph Embeddings
    • Graph Representation Learning
    • Graph Data Augmentation

The scripts I've used to scrape the data can be found in the scrapers directory.

In total, I scraped 34k+ paper abstracts and other informations.

Data Processing

Initially there were 2186 different tasks in the dataset. After some analysis, I found out 1926 of them are rare (They showed up less than 30 times in the dataset). So, I removed those tasks making the tasks count equals to 260. After that, I removed the description without any tasks. I also removed duplicate rows and cases where there were no task(s) provided. So, the resulting dataset contained total of 16304 samples.

The papersWithCode_data.csv is the generated dataset after the scraping. Which can be found inside the csv_files directory

Modeling

Finetuned a distilrobera-base model from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed here

Also, checkout other notebooks in the notebooks directory.

Model Compression & ONNX Inference

The trained model has a memory of 400+MB. I compressed this model using ONNX quantization and brought it under 85MB.

Deployment

The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or here

Web Deployment

Deployed a Flask App built to take abstract and show the tasks of the paper as output. Check flask branch. The website is live here

*Background Image Credit: The image used as the background is not mine. It was taken from here