Skip to content

🎣 A crawler for the "Open Science Framework" website.

License

Notifications You must be signed in to change notification settings

johanneshagspiel/osf-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSF Crawler Logo


MIT-License Top Language Latest Release

OSF Crawler

This repository contains a crawler for the Open Science Framework website.

Features

This crawler:

  • automatically downloads information about registered research projects or preprints from the Open Science Framework website either by crawling the website or by interacting with the official API. It then stores the information in a MongoDB database.
  • uses the natural language processing library spaCy to perform common data cleanup steps such as getting rid of stop words and lemmatizing the words and then the LDA algorithm of the topic modelling framework gensim to determine which topics were covered by the downloaded research.
  • outputs the most frequent tags, subjects as well as words used in the titles and descriptions in the form of an Excel file as well as the topics found by gensim and the corresponding coherence score of the LDA algorithm.

Tools

Purpose Name
Programming language Python 3.10
Version control system Git
HTML parser BeautifulSoup
Browser automation library Pyppeteer
NLP library spaCy
Output generator OpenPyXL
Asynchronous framework asyncio
Topic modelling framework gensim
NoSQL database MongoDB

Licence

This "OSF Crawler" is published under the MIT licence, which can be found in the LICENSE file.

References

The "Open Science Framework" logo was taken from the University of Oklahoma Libraries website.

About

🎣 A crawler for the "Open Science Framework" website.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages