GitHub - xiongzhp/paper_spider: Craw papers from Physical Review journals

Disclaimer: This project is intended to study the Scrapy Spider Framework and the MongoDB database, it cannot be used for commercial or other personal intentions. If used improperly, it will be the individuals bear.

This is a python scrapy project for crawling Physical Review Journal sites.
This project can crawl up to 5 millon papers a day, depending on your bandwith.
The crawler requests 18 threads at a time. You can request more threads and crawl a larger amount of paper per day. For the specific configuration see pre-boot configuration

Environment, Architecture

Language: Python2.7

Environment: MacOS, 8G RAM

Database: MongoDB

Mainly uses python scrapy spider framework.
Join to the Spider randomly by extracted from the Cookie pool and UA pool.
Start_requests start five Request based on paperSpider classification, and crawl the five categories at the same time.
Support paging crawl data, and join to the queue.

Instructions for use

Pre-boot configuration

Install MongoDB and start without configuration
Install Scrapy
Install Python dependent modules：pymongo, json, requests
Modify the configuration by needed, such as the interval time, the number of threads, etc.

Start up

cd paper_spider
python quickstart.py

More configurations

The default configurations only crawl "highlights" papers from journals, Physical Review A B D, you can include more journals in paperspdier_type.py.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
paperSpider		paperSpider
.gitignore		.gitignore
LICENSE		LICENSE
PhyRev.txt		PhyRev.txt
README.md		README.md
quickstart.py		quickstart.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paperSpider

paperSpider

.gitignore

.gitignore

LICENSE

LICENSE

PhyRev.txt

PhyRev.txt

README.md

README.md

quickstart.py

quickstart.py

scrapy.cfg

scrapy.cfg

Repository files navigation

Environment, Architecture

Instructions for use

Pre-boot configuration

Start up

More configurations

About

Releases

Packages

Languages

License

xiongzhp/paper_spider

Folders and files

Latest commit

History

Repository files navigation

Environment, Architecture

Instructions for use

Pre-boot configuration

Start up

More configurations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages