Skip to content

Eleonore9/get-articles-meaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aim:

Help researchers in the bilbiography process. Don't miss interesting/relevant papers!

Idea:

Get meaning from scientific articles content and classify new articles.

diagram

Tools:

IPython/Jupyter notebook, Python 2, Matplotlib, Gensim, Scikit-learn.

Data:

I used eLife Sciences articles found on Github and now in my elife-articles/ directory.

Project:

  1. I parsed the xml articles using Beautiful Soup library.

Had a few unicode induced nightmares :-/ but I've been told it'll get better once I (finally) move to Python 3.

I chose to focus on articles only marked with the topic "Cell biology" or "Neuroscience" for my two categories A and B (see diagram above).

  1. I extracted terms/topics representative of each category.

I used LSI (Latent Semantic Indexing) first and then was recommanded to try LDA (Latent Dirichlet Allocation). For both models I used Gensim library.

  1. I trained a NB (naive Bayes) classifier and a KNN (K nearest neighbour) classifier on the data for the "Cell biology" and "Neuroscience" articles.

I tried to classify a new article on the presence or absence of certain terms returned as most frequents by the LSI model.

There weren't much difference between the NB and KNN classifier.

While the accuracy was quite high (> 80%) looking at the precision and recall showed the prediction was biased towards the category that had the highest number of data in the training set.

=> I need more data!

Slides:

This project was presented at PyData London 2015 and PyCon UK 2015 conferences. Here are the latest slides.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

About

Getting meaning out of scientific articles using nltk/gensim and scikit-learn

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published