OSF Crawler

This repository contains a crawler for the Open Science Framework website.

Features

This crawler:

automatically downloads information about registered research projects or preprints from the Open Science Framework website either by crawling the website or by interacting with the official API. It then stores the information in a MongoDB database.
uses the natural language processing library spaCy to perform common data cleanup steps such as getting rid of stop words and lemmatizing the words and then the LDA algorithm of the topic modelling framework gensim to determine which topics were covered by the downloaded research.
outputs the most frequent tags, subjects as well as words used in the titles and descriptions in the form of an Excel file as well as the topics found by gensim and the corresponding coherence score of the LDA algorithm.

Tools

Purpose	Name
Programming language	Python 3.10
Version control system	Git
HTML parser	BeautifulSoup
Browser automation library	Pyppeteer
NLP library	spaCy
Output generator	OpenPyXL
Asynchronous framework	asyncio
Topic modelling framework	gensim
NoSQL database	MongoDB

Licence

This "OSF Crawler" is published under the MIT licence, which can be found in the LICENSE file.

References

The "Open Science Framework" logo was taken from the University of Oklahoma Libraries website.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resources

resources

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

OSF Crawler

Features

Tools

Licence

References

About

Releases

Packages

Languages

License

johanneshagspiel/osf-crawler

Folders and files

Latest commit

History

Repository files navigation

OSF Crawler

Features

Tools

Licence

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages