HEPcrawl

HEPcrawl is a harvesting library based on Scrapy (http://scrapy.org) for INSPIRE-HEP (http://inspirehep.net) that focuses on automatic and semi-automatic retrieval of new content from all the sources the site aggregates. In particular content from major and minor publishers in the field of High-Energy Physics.

The project is currently in early stage of development.

Installation for developers

We start by creating a virtual environment for our Python packages:

mkvirtualenv hepcrawl
cdvirtualenv
mkdir src && cd src

Now we grab the code and install it in development mode:

git clone https://github.com/inspirehep/hepcrawl.git
cd hepcrawl
pip install -e .

Development mode ensures that any changes you do to your sources are automatically taken into account = no need to install again after changing something.

Finally run the tests to make sure all is setup correctly:

python setup.py test

Run example crawler

Thanks to the command line tools provided by Scrapy, we can easily test the spiders as we are developing them. Here is an example using the simple sample spider:

cdvirtualenv src/hepcrawl
scrapy crawl Sample -a source_file=file://`pwd`/tests/responses/world_scientific/sample_ws_record.xml

Thanks for contributing!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
hepcrawl		hepcrawl
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CHANGES.rst		CHANGES.rst
LICENSE.rst		LICENSE.rst
MANIFEST.in		MANIFEST.in
README.rst		README.rst
pytest.ini		pytest.ini
scrapy.cfg		scrapy.cfg
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

hepcrawl

hepcrawl

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

AUTHORS.rst

AUTHORS.rst

CHANGES.rst

CHANGES.rst

LICENSE.rst

LICENSE.rst

MANIFEST.in

MANIFEST.in

README.rst

README.rst

pytest.ini

pytest.ini

scrapy.cfg

scrapy.cfg

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

HEPcrawl

Installation for developers

Run example crawler

About

Releases

Packages

Languages

License

Lilykos/hepcrawl

Folders and files

Latest commit

History

Repository files navigation

HEPcrawl

Installation for developers

Run example crawler

About

Resources

License

Stars

Watchers

Forks

Languages