Skip to content

jorgelbg/indexer-links

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

links-extractor

Nutch 1.x plugin that allows the inlinks and outlinks of a webpage to be indexed. By default this plugin ignores those outlinks which host matches the host of the webpage being indexed. This behaviour could be bypassed by adding the following into your nutch-site.xml.

<property>
  <name>outlinks.host.ignore</name>
  <value>false</value>
</property>

The same considerations taken with the outlinks are taken with the inlinks, basically by default only the inlinks coming from a host different than the host of the webpage are indexed, if you want to change this and index all the outlinks you can do that via the nutch-site.xml configuration fil,; just add the following:

<property>
  <name>inlinks.host.ignore</name>
  <value>false</value>
</property>

In case you're only interested in the host portion of the inlinks and outlinks you should enable a behaviour that allows to index only the host part of the URL, by default the full URL is stored.

<property>
  <name>links.hosts.only</name>
  <value>true</value>
</property>

About

Nutch 1.x plugin that allows the inlinks and outlinks of a webpage to be indexed

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages