Skip to content

healthonnet/hon-lucene-siterank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Lucene/Solr Site Rank Function

Developer

Nolan Lawson

Health On the Net Foundation

License

Apache 2.0

Summary

Custom Solr Function Query that returns the Alexa site rank of an input URL, host, or domain. This can be useful for boosting web documents based on their Alexa rank.

In other words, it's a poor man's PageRank.

This module also performs some light caching, in order to show good etiquette to Alexa, who probably don't want folks hammering their servers.

Setup

First off, put the following JARs into your Solr's lib/ directory:

(Yes, I'm requiring Google Guava for this. It helps protect my sanity when I code in Java these days.)

Next, add the following definition to your solrconfig.xml:

<valueSourceParser name="siterank" class="org.healthonnet.lucene.siterank.SiteRankSourceParser">
    <bool name="doCache">true</bool>
    <str name="cacheSpec">concurrencyLevel=16,maximumSize=8192,softValues</str>
    <bool name="extractDomainFromUrl">true</bool>
</valueSourceParser>

Shown above are all the configuration parameters with their default values. You can leave them out if you're okay with the defaults.

Parameters

  • doCache: True if caching should be enabled
  • cacheSpec: Configuration for the cache, in Guava CacheBuilderSpec format.
  • extractDomainFromUrl: Are you inputting full URLs, like http://www.google.com/mail? Then set this to true. Otherwise, if you're inputting stripped-down domain or host names, such as google.com, then set it to false. This is used as a performance improvement at the caching level, so we don't have to look up the same domain over and over again just because the URL is different.

Usage

This module defines a new function called siterank().

The function takes in a string (either a full URL or a domain/host - see above) and outputs the reciprocal rank of the site, which is a double between 0.0 and 1.0. 0.0 is returned if the site is not found in the ranking.

The reciprocal rank is simply:

1.0 / rank

...so e.g. Google will probably have a reciprocal rank of 1.0 (1.0 / 1.0), WebMD might have 0.00239234 (1/0 / 418) and MyCoolHipsterSiteNobodyKnowsAbout.com might have 0.0000000198867735 (1.0 / 50284678).

Most likely you will want to wrap this function in something like exp() to smooth the values, and to deal with cases where the function returns 0.0. So the recommended usage is:

exp(siterank(myUrlOrHostField))

...which you can use as a boost function, e.g.

http://mySite:8983/solr/select?q={!boost b=exp(siterank(myUrlOrHostField))}:

So for instance, in the above examples, Google would have a score of 2.71828, WebMD would get 1.00239520844, and the hipster site would get 1.00000001989. Tweak the formula as you see fit.

See my blog post on boosting for more details about boosting in Solr.

Future work

In the future, I'd like to expand this module to output rankings from other sources than Alexa, including custom config files.

Compile it yourself

Download the code and do:

mvn install

About

Custom Solr function to get the site rank of a URL. Useful for PageRank-like boosting.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages