An implementation of Hyperlink-Induced-Topic-Search (HITS) algorithm with Python 3.
HITS ranks web pages according to users' query, just like Page Rank.
HITS computes on a web graph. In this implementation, the graph expressed as
inlinks.json
: A JSON that maps a URL to its inlink URLs.outlinks.json
: A JSON that maps a URL to its outlink URLs.docno_list.json
: A JSON that contains all URLs.
The web graph I crawled with my crawler is here.
To compute HITS, we also need a Root set root_set.json
and a Base set base_set.json
.
You may modify hits_get_root_base.py
and config.yaml
to compute Root set and Base set by
$ python hits_get_root_base.py
if you use Elasticsearch.
The Root set and Base set I get form my web graph is here.
Run
$ python hits.py
to run the algorithm when you are ready with inlinks.json
, outlinks.json
, docno_list.json
, root_set.json
and base_set.json
and put them under the info/
directory.
The result will be under result/
directory, where authority.json
contains authority pages and hub.json
contains hub pages.
- Hubs, Authorities, and Communities, Kleinberg, Jon (1999).
- HITS algorithm - Wikipedia.