Publication figure web scraping

This tool provides a method for scraping through NCBI's PMC publications and retrieving (downloading) the figures from open access and publicly available publications.

Requirements

Node.js >= 16.13.1
RAM >= 4GB
Internet connection with greater than 7mb/s download speed

Installation & Setup

If you would like to run or modify the publication figure web scraping tool locally, clone the repository with git by running the following command:

git clone https://github.com/AlexJSully/Publication-Figures-Web-Scraping.git

Then run npm install then npm start. This tool runs within your node environment. On Windows, this script needs to run in an administrator mode.

The images are downloaded then downloaded locally within this containing directory under src/data/figures/{species}/{PMC ID}.

If you would like to run against commercial use publications, you will need to download oa_comm_use_file.list.txt from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/ then run npm run process. Once that is done, set index.js init function to true (await init(true);)

The publication figure scraper will resume where you last left off. If you would like to reset the scraper, empty species-pmid-list.json, data-retrieved.json and data-empty-pubs.json to contain only just an empty JSON object ({}).

If you would like to add more species support for publications to be scraped, add the species to species.json and then run npm start. Currently, this JSON includes species' common aliases which are not currently being used but may be useful in the future. If you would like to scrape a single species, then change speciesList in index.js to an array of species scientific name(s) to scrape. For example: speciesList = ['Arabidopsis thaliana']; // Or whatever species name(s) you would like to scrape. Currently, it is set to scrape all species within the species.json file.

If in the instance that you do not have an internet connection/speed greater than 7mb/s, you will need to change all the Axios request timeouts in data-retrieval.js to a value of at least half of your speed (e.g. down speed of 10mb/s, set timeout to 5s).

Known issues

We aim to make this tool as perfect as possible but unfortunately, there may be some unforeseen bugs. If you manage to find one that is not here, feel free to create a bug report so we can fix it.

None at the moment... Help us find some!

Contributing

Please read CONTRIBUTING.md for more details.

License

GLP-2.0

Maintenance Mode

This project is currently in maintenance mode. This means that:

Only critical bug fixes and security updates will be addressed.
New feature requests are unlikely to be implemented.

Sponsorship

If you want to support my work, you can through the following methods:

BTC - 3Lp4pwF5nXqwFA62BYx4DSvDswyYpskBog
ETH - 0xc6EB17BD7cbe5976Bfc4f845669cD66Ff340a1A2
PayPal - paypal.me/alexjsully

Authors

Alexander Sullivan - GitHub, Twitter, ORCiD, LinkedIn, Website

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github		.github
src		src
.deepsource.toml		.deepsource.toml
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.MD		SECURITY.MD
_config.yml		_config.yml
package-lock.json		package-lock.json
package.json		package.json

License

AlexJSully/Publication-Figures-Web-Scraping

Folders and files

Latest commit

History

Repository files navigation

Publication figure web scraping

Requirements

Installation & Setup

Known issues

Contributing

License

Maintenance Mode

Sponsorship

Authors

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project

Languages