Skip to content
renepickhardt edited this page Aug 13, 2013 · 13 revisions

See simpleNutchSolrSetup for a sample setup of Nutch.

See setupZookeeperHadoopHbaseTomcatSolrNutch for an advanced setup.

a general overview can be found in: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.5978&rep=rep1&type=pdf

Configuration

Default configuration resides in conf/nutch-default.xml, but you shouldn't change that. Rather copy relevant settings to conf/nutch-site.xml.

conf/regex-urlfilter.txt filters urls based on regular expressions. Allow urls with +<regex> and disallow with -<regex>. Careful default configuration allows anything that isn't disallowd (+.).

Since Nutch 2.x is only provided as a source distribution config can be done either in nutchdir/conf or in nutchdir/runtime/local/conf. I'd recommend doing configuration in the former, because else every recompile overwrites settings. But in turn we have to recompile every time configuration changes:

$ cd nutchdir/
$ ant runtime

operating nutch

there is a list of nutch commands for the command line in the official nutch wiki:

metalcon specific crawling

sites we don't crawl:

  • wikipedia
  • facebook
  • myspace
  • youtube
  • last.fm
  • reverbnation
  • bandcamp
  • bandzone.cz
  • soundcloud
  • tape.tv

Tutorial

there is a nice tutorial at:

Glossary

  • batchId

    When generating URLs to be fetched later, a batchId can be assigned to a batch of generated URLs. This allows you to first generate multiple batches of URLs, and then fetch them later one after another without having to wait for one big fetch to finish.

  • crawlId

    Identifier that describes a crawl. Might it be useful to just use timestamps to generate crawlIds?

Sources

Clone this wiki locally