Skip to content

Lilykos/hepcrawl

 
 

Repository files navigation

HEPcrawl

HEPcrawl is a harvesting library based on Scrapy (http://scrapy.org) for INSPIRE-HEP (http://inspirehep.net) that focuses on automatic and semi-automatic retrieval of new content from all the sources the site aggregates. In particular content from major and minor publishers in the field of High-Energy Physics.

The project is currently in early stage of development.

Installation for developers

We start by creating a virtual environment for our Python packages:

mkvirtualenv hepcrawl
cdvirtualenv
mkdir src && cd src

Now we grab the code and install it in development mode:

git clone https://github.com/inspirehep/hepcrawl.git
cd hepcrawl
pip install -e .

Development mode ensures that any changes you do to your sources are automatically taken into account = no need to install again after changing something.

Finally run the tests to make sure all is setup correctly:

python setup.py test

Run example crawler

Thanks to the command line tools provided by Scrapy, we can easily test the spiders as we are developing them. Here is an example using the simple sample spider:

cdvirtualenv src/hepcrawl
scrapy crawl Sample -a source_file=file://`pwd`/tests/responses/world_scientific/sample_ws_record.xml

Thanks for contributing!

About

Scrapy project for feeds into INSPIRE-HEP

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%