Apache Nutch and Apache Solr PoC on Docker Swarm

An ultra small PoC to show how to combine Apache Nutch and Apache Solr, crawling through web pages and storing the results in Solr for quering

Why?

This is a very very very simple implementation with which any website can be crawled. Everyone who is able to install Docker is able to run this.

Yeah, but why?

This example can be used for many purposes:

(Security) testing how a search tree of a published website would look
Searching for text on an unindexed (local?) websites to find content
As a starting point of building your own Google (please don't, just don't, this stuff doesn't scale, use storm-crawler + Hadoop for that)
Etc.

Prerequisites

Docker desktop (enable shared C: in its settings!)

How to use

docker-compose up

How it works

A default Solr instance with the default "mycore" core is used to store Nutch crawling results. The approach of this PoC is to use as less custom configuration as possible so it can be used as a starting point for other uses. Some important files:

seed.txt: The starting urls on which the crawl should start
regex-urlfilter.txt: The filter which is used to only filter out targeted urls
index-writers.xml: The Nutch config which is used to link Nutch and Solr
docker-compose.yml: The infrastructure configuration, including the Nutch start command

Adjusting this demo

The /nutch folder contains all Nutch configuration

Author

Sebastiaan Raven

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
nutch		nutch
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nutch

nutch

LICENSE

LICENSE

README.md

README.md

docker-compose.yml

docker-compose.yml

Repository files navigation

Apache Nutch and Apache Solr PoC on Docker Swarm

Why?

Yeah, but why?

Prerequisites

How to use

How it works

Adjusting this demo

Author

About

Releases

Packages

License

basraven/nutch-solr-integration

Folders and files

Latest commit

History

Repository files navigation

Apache Nutch and Apache Solr PoC on Docker Swarm

Why?

Yeah, but why?

Prerequisites

How to use

How it works

Adjusting this demo

Author

About

Topics

Resources

License

Stars

Watchers

Forks