Skip to content

simpleNutchSolrSetup

Hemendra Kumar edited this page Mar 14, 2015 · 8 revisions

This tutorial will create a simple Nutch 2.2.1 + Solr 4.3.1 setup.

Since 2.x Nutch uses Apache Gora as a datastore backend. You will have to choose a specific Gora datastore. In this tutorial we use HBase 0.90.4.

Be careful only certain versions of these tools work together seamlessly. Don't always choose the latest version of a program.

Download

Create a new directory, download these files and extract them. We will call this directory trynutch in this tutorial.

  • Nutch. This tutorial uses Nutch 2.2.1.
  • HBase. This tutorial uses HBase 0.90.4.
  • Solr. This tutorial uses Solr 4.3.1.

Configure HBase

You will need to set HBase and Zookeeper storage dirs. Edit trynutch/hbase-0.90.4/conf/hbase-site.xml:

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///path/to/trynutch/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/path/to/trynutch/zookeeper</value>
  </property>
</configuration>

After this you should be able to start HBase with the following command:

$ ./trynutch/hbase/bin/start_hbase.sh

especially you can run hbase command line util:

$ ./bin/hbase shell

You can stop HBase again with this command:

$ ./trynutch/hbase/bin/stop_hbase.sh

(On my mashine sometimes stop_hbase.sh takes forever. Deleting trynutch/hbase and trynutch/zookeeper, clearing /tmp, and restarting a couple of times seems to fix this.)

If you have trouble running hbase on an ubuntu system you might want to look at /etc/hosts and see if your host and localhost have the same IP adress (127.0.0.1). On ubuntu systems your host nowadays has 127.0.1.1 find more information about this problem

Configure Nutch

We need to setup a name for our web crawler. We also need to tell Nutch that we use HBase as a Gora datastore backend. Edit trynutch/apache-nutch-2.2.1/conf/nutch-site.xml.

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>your-crawler-name</value>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
  </property>
</configuration>

Change this line in trynutch/apache-nutch-2.2.1/conf/gora.properties:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Open trynutch/apache-nutch-2.2.1/ivy/ivy.xml. Scroll down to section Gora artifacs and uncomment this line:

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

Now we need to compile Nutch (since 2.x only source archives are available).

$ cd trynutch/apache-nutch-2.2.1/
$ ant runtime

(This might take a long time for the first time. On my mashine it took 25 minutes.)

Configure Solr

The database schema that comes with Nutch is outdated.

Download this schema and save it as trynutch/solr-4.3.1/example/solr/collection1/conf/schema.xml.

Start Solr.

$ cd trynutch/solr-4.3.1/example/
$ java -jar start.jar

If Solr is running you should be able to access the following site:

http://localhost:8983/solr/admin/

Running nutch

Make sure HBase and Solr are running.

To limit crawl range for this tutorial edit trynutch/apache-nutch-2.2.1/runtime/local/conf/regex-urlfilter.txt and change the last line to:

+^http://work-at-google.com

Crawl with Nutch.

$ cd trynutch/apache-nutch-2.2.1/runtime/local/
$ mkdir urls
$ echo "http://work-at-google.com" > urls/seed.txt
$ bin/nutch inject urls
$ bin/nutch generate -topN 5
$ bin/nutch fetch -all
$ bin/nutch parse -all
$ bin/nutch updatedb

Now feed this data Solr.

$ bin/nutch solrindex http://localhost:8983/solr/ -all

You can now search over you data in Solr under http://localhost:8983/solr/#/collection1/query.

Sources

Clone this wiki locally