Scientific Paper Keywords Categorization

Project Development Journal

`Problem Statement`

Fetching papers abstract and keywords, I will create a multi-label keywords classifier that can classify an abstract within selected keywords.

`Objective`

Keywords is a necessary part of a scientific paper. It helps the search engines to show papers to their users based on relatable topics. So, choosing these words properly is really important. The goal here now is to create a developed and optimized keyword categorizer that can classify a scientific paper between particular keywords based on the abstract of the paper.

`Data Collection`

To collect data, I decided to scrape the available open access papers at IEEE. I created the scraper files using selenium after inspecting the website. Firstly I collected the urls of the papers using "url_scraper". Then visiting the urls, I fetched the abstract and the IEEE and author keywords using "details_scraper". Facing some unpredictable issues, I managed to scrape data and stored them in different .csv files. You can check out the scraper files within "scrapers" folder.

`Data Cleaning & Pre-processing`

Within almost all the columns, there were some NaN or redundant values. In the case of "abstracts" column, some values were repetative and irrelavant. Those are considered as the inappropriate ones. So, those rows were deleted. Then I merged the IEEE and author keywords together. From there, I took the most commonly used keywords on the basis of the threshold value of 0.004. Henceafter, I dropped the rows having NaN or the rare keywords and created the final dataset. You can check the data cleaning part in the "data_cleaning" notebook. The following table shows the overview of initial and final csv files. The final dataset is available here.

File Name	Data Type	Rows	Columns
merged_data	Tabular Text	40457	3
papers_final_data	Tabular Text	36398	2

`Dataloader Creation`

I encoded the unique keywords. Then I proceed to the row-wise indexing for the available keywords of that row. For different models, the pre-processing part may differ. So, I imported the pre-defined configurations for each model. I splitted the dataset as 90% training and 10% validation set. Finally I created different dataloaders with a batch size of 16. You can check the data loader creation part in the "dataloader_creation" notebook.

`Model Experimentations`

To classify an abstract into multi-labels, I choose BERT and it's 2 variants. Those are: -

BERT
DistilBERT
RoBERTa

Training process: -

I freezed the model with it's pre-trained weights and ranged the learning rate between suitable values.
Then I trained the model for 10 epochs using fit_one_cycle() method.
After that, I unfreezed the trained model and again selecting a learning rate range, trained the model for 10 epochs.

In the case of BERT and DistilBERT, the whole training process gave a satisfactory result. But for RoBERTa, after unfreezing and training it again cost overfitting problem. So, it shows a better performance in its freezing phase.

`Model Evaluation`

Model	Micro Average			Weighted Average
Model	Precision	Recall	F1-Score	Precision	Recall	F1-Score
BERT	62.211	45.104	52.294	60.635	45.104	50.618
DistilBERT	65.810	40.588	50.209	63.739	40.588	48.119
RoBERTa	69.113	20.353	31.446	59.215	20.353	24.646

If we look at the evaluation table, it is clear that all our model is showing high precision and low recall values in all the cases. That's why a drastic change can be seen in the f1-score values. Although we got high precision values, but it is not showing a satisfactory results that meets expectation. In the case of BERT and DistilBERT, the models are not predicting all the expected classes but the predictions are selective and precise. That's why we see higher precision values. But it's not predicitng more extra classes that results in lower recall values. On the contrary, RoBERTa is more precise and correct. Though, it's predicting less extra classes. Finally, our balancing metric f1-score comes and it shows BERT as one of the best among these. Furthermore, I choose it to move forward with other tasks.

`Model Compression`

I compressed the model using ONNX. The model size got reduced to 87.45%. But the reduction costs a performance drop in the prediction. To evaluate this, I used micro average f1-score as the performance metrics. There is a 2.8% drop in the performance of the compressed model.

Model	Size(MB)	Performance
BERT	838.8	52.2939
Compressed BERT	105.3	50.8322

`Deployment`

I deployed the model using huggingface. Check out the deployment here.

`Integration to website`

I integrated the model using render. Check out the live website here.

Home Page	Prediction Result

`Short Video Demonstration`

I prepared a short video demonstration and shared it as a linked in post. Check it out here.

`References`

Fallah, Haytame, et al. "Adapting transformers for multi-label text classification." CIRCLE (Joint Conference of the Information Retrieval Communities in Europe) 2022. 2022.

`Challenges Faced`

After a scraper script runs for a long time, sometimes it shows "Aw, Snap!" message in the running chrome. In that case, I just reloaded the webpage mannually and then it started working properly as previous.
The required webelements distribution in all webpages wasn't the same. For some webpages, the scraper collecting details were working fine but it showed exceptions for those. So, I had to re-write some codes considering the different ones and generalize the codes.
As I had to collect a lot of data, so, I created same type of scrapers and running them simultaneously from different indexes. It boosted my data collection process a bit although it depended much on internet speed.
Some abstracts contains values like "Retracted.", "Final version", "IEEE Plagarism Policy." and some more unconsiderable values. So, I went through the whole dataset and found these values mannually for the data cleaning process.
In the end, it took huge time to collect a desirable amount of data. So, I had to wait with patience.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
csvFiles		csvFiles
dataloaders		dataloaders
dataset		dataset
deployment		deployment
models		models
notebooks		notebooks
readmeFileImages		readmeFileImages
scrapers		scrapers
LICENSE		LICENSE
README.md		README.md
bert-transformer.ipynb		bert-transformer.ipynb
requirements.txt		requirements.txt

License

Neloy-Barman/Scientific-Paper-Keywords-Categorization

Folders and files

Latest commit

History

Repository files navigation

Scientific Paper Keywords Categorization

Project Development Journal

Problem Statement

Objective

Data Collection

Data Cleaning & Pre-processing

Dataloader Creation

Model Experimentations

Model Evaluation

Model Compression

Deployment

Integration to website

Short Video Demonstration

References

Challenges Faced

About

Topics

Resources