Fork of partitioned_elias_fano

This is a fork of partitioned_elias_fano. Modifications in this repository fix build breaks on Ubuntu 16.04 LTS and update the build system to avoid generating executables and libraries in the source tree.

Building on Ubuntu 16.04 LTS

Please note that this repository depends on submodules, so it helps to clone with the --recursive flag.

sudo apt-get install libboost-all-dev
git clone --recursive git@github.com:BitFunnel/partitioned_elias_fano.git
cmake . -DCMAKE_BUILD_TYPE=Release
make

Running Performance Test

First generate an index using create_freq_index. One way to create the collection file used as input is to use the 'IndexExporter' class in the mg4j-workbench project. One can use mg4j to generate an index, from say the gov2 corpus, and then export this index and an accompanying query log to format usable by create_freq_index and Runner.

Usage:
  create_freq_index type input output --check

  type     Index type. Choices are ef, single, uniform, opt
           block_optpfor, block_varint, and block_interpolative.
  input    Collection file name. Format described below.
  output   Path to output file.

Use Runner to perform performance measurements. The index parameter to Runneris the output file generated by create_freq_index. Runner runs performance trials at various thread counts and outputs summary data along with detailed, query-by-query statistics.

Usage:
  Runner type index queries maxthreads [output]

  type     Index type. Choices are ef, single, uniform, opt
           block_optpfor, block_varint, and block_interpolative.
  index    Index file name.
  queries  Query log file name. One query per line.
           Terms expressed as non-negative integers.
  threads  Run performance trials with thread counts in [1..threads].
  output   Base name for output files containing performance
           measurements for various thread counts. If output is
           omitted, summary data will be printed to stdout.

Original README.md follows.

partitioned_elias_fano

NOTE: This repository is maintained only for historical reasons. This code is now part of ds2i.

This repository contains the code used for the experiments in the paper

Giuseppe Ottaviano and Rossano Venturini, Partitioned Elias-Fano Indexes, ACM SIGIR 2014.

Building the code

The code is tested on Linux with GCC 4.9 and OSX Mavericks with Clang.

The following dependencies are needed for the build.

CMake >= 2.8, for the build system
Boost >= 1.42

The code depends on several git submodules. If you have cloned the repository without --recursive, you will need to perform the following commands before building:

$ git submodule init
$ git submodule update

To build, it should be sufficient to do:

$ cmake . -DCMAKE_BUILD_TYPE=Release
$ make

It is also preferable to perform a make test, which runs the unit tests.

Running the experiments

The directory test/test_data contains a small document collection used in the unit tests. The binary format of the collection is described in the next section.

To create an index use the command create_freq_index. The available index types are listed in index_types.hpp. For example, to create an index using the optimal partitioning algorithm using the test collection, execute the command:

$ ./create_freq_index opt test/test_data/test_collection test_collection.index.opt --check

where test/test_data/test_collection is the basename of the collection, that is the name without the .{docs,freqs,sizes} extensions, and test_collection.index.opt is the filename of the output index. --check perform a verification step to check the correctness of the index.

To perform BM25 queries it is necessary to build an additional file containing the parameters needed to compute the score, such as the document lengths. The file can be built with the following command:

$ ./create_wand_data test/test_data/test_collection test_collection.wand

Now it is possible to query the index. The command queries parses each line of the standard input as a tab-separated collection of term-ids, where the i-th term is the i-th list in the input collection. An example set of queries is again in test/test_data.

$ ./queries opt test_collection.index.opt test_collection.wand < test/test_data/queries

Collection input format

A binary sequence is a sequence of integers prefixed by its length, where both the sequence integers and the length are written as 32-bit little-endian unsigned integers.

A collection consists of 3 files, <basename>.docs, <basename>.freqs, <basename>.sizes.

<basename>.docs starts with a singleton binary sequence where its only integer is the number of documents in the collection. It is then followed by one binary sequence for each posting list, in order of term-ids. Each posting list contains the sequence of document-ids containing the term.
basename.freqs is composed of a one binary sequence per posting list, where each sequence contains the occurrence counts of the postings, aligned with the previous file (note however that this file does not have an additional singleton list at its beginning).
basename.sizes is composed of a single binary sequence whose length is the same as the number of documents in the collection, and the i-th element of the sequence is the size (number of terms) of the i-th document.

Authors

Giuseppe Ottaviano giuott@gmail.com
Rossano Venturini rossano@di.unipi.it

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
FastPFor @ b48e23d		FastPFor @ b48e23d
Runner		Runner
integer_encoding_library @ 985cded		integer_encoding_library @ 985cded
succinct @ ea74747		succinct @ ea74747
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
all_ones_sequence.hpp		all_ones_sequence.hpp
binary_collection.hpp		binary_collection.hpp
binary_freq_collection.hpp		binary_freq_collection.hpp
bitvector_collection.hpp		bitvector_collection.hpp
block_codecs.cpp		block_codecs.cpp
block_codecs.hpp		block_codecs.hpp
block_freq_index.hpp		block_freq_index.hpp
block_posting_list.hpp		block_posting_list.hpp
bm25.hpp		bm25.hpp
compact_elias_fano.hpp		compact_elias_fano.hpp
compact_ranked_bitvector.hpp		compact_ranked_bitvector.hpp
configuration.hpp		configuration.hpp
create_freq_index.cpp		create_freq_index.cpp
create_wand_data.cpp		create_wand_data.cpp
freq_index.hpp		freq_index.hpp
global_parameters.hpp		global_parameters.hpp
index_types.hpp		index_types.hpp
indexed_sequence.hpp		indexed_sequence.hpp
integer_codes.hpp		integer_codes.hpp
optimal_partition.hpp		optimal_partition.hpp
partitioned_sequence.hpp		partitioned_sequence.hpp
positive_sequence.hpp		positive_sequence.hpp
queries.cpp		queries.cpp
queries.hpp		queries.hpp
query_log_runner.cpp		query_log_runner.cpp
query_log_runner.h		query_log_runner.h
semiasync_queue.hpp		semiasync_queue.hpp
sequence_collection.hpp		sequence_collection.hpp
strict_elias_fano.hpp		strict_elias_fano.hpp
strict_sequence.hpp		strict_sequence.hpp
uniform_partitioned_sequence.hpp		uniform_partitioned_sequence.hpp
util.hpp		util.hpp
wand_data.hpp		wand_data.hpp

License

BitFunnel/partitioned_elias_fano

Folders and files

Latest commit

History

Repository files navigation

Fork of partitioned_elias_fano

Building on Ubuntu 16.04 LTS

Running Performance Test

partitioned_elias_fano

Building the code

Running the experiments

Collection input format

Authors

About

Resources

License

Stars

Watchers

Forks

Languages