miniaturist

This project provides implementations of a Latent Dirichlet Allocation algorithm found here. Latent Dirichlet Allocation is colloquially called "topic modeling". This project offers topic modeling capabilities that work as a sequential program, a parallel (threaded) program, and a distributed parallel program that can be executed on High Performance Computing (HPC)/Supercomputing systems or a Cloud.

The following implementations are provided:

lda - the sequential implementation
parlda - the parallel implementation for multi-core systems
distparlda - distributed, parallel, implementation for High Performance Computers (HPC)/Supercomputers
distparldahdfs - distributed, parallel, implementation for Clouds (Hadoop filesystem, HDFS)

Optional extensions provided:

A plugin for the Phylanx distributed array toolkit
Python bindings (modules: pylda, pyparlda, pydistparlda)

The following tools are provided:

vocab - a sequential program to compute the vocabulary set found in all documents of a corpus
distvocab - a distributed memory program to compute the vocabulary set found in all documents of a corpus
distvocabhdfs - a distributed memory program to compute the vocabulary set found in all documents of a corpus on HDFS

The implementation uses a modified version of the collapsed gibbs sampler as defined by Newman, Asuncion, Smyth, and Welling.

Modifications have been made to the Newman, et al treatment in an effort to utilize term-document-matrices as the storage structure for document histograms.

This implementation only scales in the direction of documents. Users are required to provide a vocabulary list in order to make use of this implementation. Vocabulary lists can be populated programmatically or loaded into a modeling program with a file containing a 'new-line' delimited list of words.

The vocabulary building tools print out a set of words encountered during 1 linear traversal of the documents. All vocabulary building tools print results to stdout (the terminal).

The distributed vocabulary building tool prints to the stdout of each machine it is running upon; it is suggested that users pipe the output of the distributed vocabulary building tool to a distributed filesystem using a filename that is: unique to the locality identifier (integer) of the program instance, or into a remote /tmp directory that is accessible for a file copy (scp).

How To Build The Container

Container users should use the following options. The container is Ubuntu 20.04 based and uses some packages from 'universe'.

sudo singularity build miniaturist.sif miniaturist.def
singularity build --fakeroot miniaturist.sif miniaturist.def

The container uses wget to download a binary build of cmake from kitware's website and the OTF2 source from VI-HPS. Other software dependencies are git cloned from their respective source repositories and compiled into the container.

How To Build From Source

This project requires using cmake. cmake requires creating a directory called 'build'. Users will change directory into 'build'. At this point the user will need to type something like the following:

cmake -Dblaze_DIR=<PATH_TO_BLAZE_CMAKEFILE> -DHPX_DIR=<PATH_TO_HPX_CMAKEFILE> ..

These are possible directories where the blaze and hpx cmakefiles can be found:

PATH_TO_BLAZE_CMAKEFILE=/usr/share/blaze/cmake
PATH_TO_HPX_CMAKEFILE=/usr/lib/cmake/HPX

Add the following for Cloud (Hadoop File System - HDFS) support:

cmake -Dblaze_DIR=<PATH_TO_BLAZE_CMAKEFILE> -DHPX_DIR=<PATH_TO_HPX_CMAKEFILE> -Dlibhdfs3_DIR=<PATH_TO_LIBHDFS3_INSTALL> ..

This is a possible location for libhdfs3.so and hdfs/hdfs.h:

PATH_TO_LIBHDFS3_INSTALL=/usr/

Add the following for Python bindings to be built. To inform cmake where pybind11 is installed, use the -Dpybind11_DIR=<PATH_TO_PYBIND11_CMAKEFILE> flag.

cmake -Dblaze_DIR=<PATH_TO_BLAZE_CMAKEFILE> -DHPX_DIR=<PATH_TO_HPX_CMAKEFILE> -Dpybind11_DIR=<PATH_TO_PYBIND11_CMAKEFILE>

Here is a possible directory where the pybind11 cmakefiles are located:

PATH_TO_PYBIND11_CMAKEFILE=/usr/share/cmake/pybind11

How To Use

Topic Modeling Program names:

lda, single node, sequential (no threads), implementation
parlda, single node, parallel, implementation
distparlda, distributed (multi-node), parallel, implementation
distparldahdfs, distributed (multi-node), parallel, HDFS implementation

Command line arguments for all topic modeling programs:

--num_topics=[enter an unsigned integer value for number of topics], required
--vocab_list=[enter a valid path to the file containing the vocabulary list], required
--corpus_dir=[enter a valid path to the directory containing the training corpus], required
--regex=[enter a regular expression], default [\p{L}\p{M}]+
--num_iters=[enter an unsigned integer value for iterations], default 1000
--alpha=[enter a floating point number for alpha prior], default 0.1
--beta=[enter a floating point number for beta prior], default 0.01

Additional command line arguments for parlda:

--hpx:threads=[enter an unsigned integer value for number of threads], optional
--hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional

Additional command line arguments for distparlda:

--hpx:threads=[enter an unsigned integer value for number of threads], optional
--hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional
--hpx:nodes=[enter an unsigned integer value for number of threads], optional

Additional command line arguments for distparldahdfs:

--hpx:threads=[enter an unsigned integer value for number of threads], optional
--hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional
--hpx:nodes=[enter an unsigned integer value for number of threads], optional
--hdfs_namenode_address=[enter string], required
--hdfs_namenode_port=[unsigned integer for hdfs namenode port], required
--hdfs_buffer_size=[unsigned integer buffer size for file reads from hdfs], default 1024
--hdfs_block_size=[unsigned integer buffer size for file writes to hdfs], default 1024

Vocabulary Building Program names:

vocab, single node, sequential (no thread), vocabulary builder
distvocab, distributed, sequential (no thread), vocabulary builder
distvocabhdfs, distributed, sequential (no thread), vocabulary builder for HDFS

Command line arguments for all vocabulary programs:

--corpus_dir=[enter a valid path to the directory containing the training corpus], required
--regex=[enter a regular expression], default [\p{L}\p{M}]+, optional
--filter=[unsigned integer frequency count above which vocabulary words are printed out], optional
--histogram, print out the global count of each word (default off), optional

Additional command line arguments for distvocabhdfs:

--hdfs_namenode_address=[enter string], required
--hdfs_namenode_port=[unsigned integer for hdfs namenode port], required
--hdfs_buffer_size=[unsigned integer buffer size for file reads from hdfs], default 1024
--hdfs_block_size=[unsigned integer buffer size for file writes to hdfs], default 1024

Topic Modeling Libraries:

libldalib.a, ldalib.hpp single node, sequential (no threads), implementation
libparldalib.a, parldalib.hpp single node, parallel, implementation
libdistparldalib.a, distparldalib.hpp distributed (multi-node), parallel, implementation

The Python bindings requires users to type the following in python3.8:

from pylda import lda

How To Use Container

Use the following command line arguments to print out the command line arguments enumerated above for each program in the container.

singularity help miniaturist.sif
singularity help --app vocab miniaturist.sif
singularity help --app distvocab miniaturist.sif
singularity help --app distvocabhdfs miniaturist.sif
singularity help --app lda miniaturist.sif
singularity help --app parlda miniaturist.sif
singularity help --app distparlda miniaturist.sif
singularity help --app distparldahdfs miniaturist.sif

Implementation Notes

This implementation loads the corpus into an inverted index. The inverted index is converted into a sparse matrix that is transposed into a document-term matrix. This step is required to permit the parallel processing of the corpus by document. The implementation processes subsets of the document-term matrix in parallel chunks. A sparse matrix is used to store the document-term matrix to minimize a required O(N^2) algorithmic cost spent traversing the matrix.

The most time-consuming portion of each implementation is the creation of the sparse matrix which stores the document-term matrix. To accelerate the creaton of the matrix, it is strongly encouraged that a user spends a fair amount of time studying the corpus to identify a reasonably sized vocabulary set. The larger the vocabulary set the sparser the document-term matrix becomes. There is a direct correlation between each implementation's performance and the size of the vocabulary set.

Usage Notes

If you use the vocabulary building tools with a specific regular expression in mind, make sure to use the same regular expression when invoking lda, parlda, or distparlda. Consistent use of regular expressions parsing text when building the vocabulary and when modeling is important for successful program execution.

HPX Compilation Flags

Take time to review the following build options for HPX here. Below are a short list of recommended options.

For Cloud Environments:

HPX_WITH_PARCELPORT_TCP=ON
HPX_WITH_FAULT_TOLERANCE=ON

For HPC Environments w/MPI:

HPX_WITH_PARCELPORT_MPI=ON

For HPC Environments w/libfabric:

HPX_WITH_PARCELPORT_LIBFABRIC=ON
HPX_WITH_PARCELPORT_LIBFABRIC_PROVIDER=[gni,verbs,psm2,etc]

Dynamic runtime instrumentation for Cloud and/or HPC:

HPX_WITH_APEX=ON (requires APEX installation)

For OpenMP enabled LAPACK, BLAS, Blaze implementations:

hpxMP - provides HPX/APEX the ability to manage OpenMP

Licenses

The Jump Consistency Hash source code was taken from a blog post with no license provided. The blog post and author are referenced in a comment at the top of the file 'jch.hpp'.
The drand48 logic comes from musl-libc. The source code is MIT Licensed.
The remainder of the source code in this project is Boost Licensed and the license terms can be found in the file 'LICENSE'.

Dependencies

C++17
STE||AR HPX
Blaze
ICU
LAPACK
BLAS
OpenSSL
pkg-config
cmake >= 3.17

Optional Dependencies

APEX
hpxMP
Phylanx
pybind11 (Python support)
libhdfs3 (Hadoop Filesystem/HDFS support)
singularity

Special Thanks

STE||AR Group
Phylanx
Blaze C++ Library (Blaze)
Erlend Hamberg (Jump Consistency Hash)
Erik Muttersbach (libhdfs3)

References

D. Newman, A. Asuncion, P. Smyth, M. Welling. "Distributed Algorithms for Topic Models." JMLR 2009.
H. Kaiser, M. Brodowicz and T. Sterling: ParalleX: An Advanced Parallel Execution Model for Scaling-Impaired Applications, International Conference on Parallel Processing Workshops (2009 – Los Alamos, California).
Kevin Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. An Autonomic Performance Environment for Exascale. Supercomputing frontiers and innovations, 2.3 (2015).
Jeremy Kemp; Tianyi Zhang; Shahrzad Shirzad; Bryce Adelstein Lelbach aka wash; Hartmut Kaiser; Bibek Wagle; Parsa Amini; Alireza Kheirkhahan. "hpxMP v0.3.0: An OpenMP runtime implemented using HPX".

Author

Christopher Taylor

Date

07/03/2021

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
distparlda.cpp		distparlda.cpp
distparldahdfs.cpp		distparldahdfs.cpp
distparldalib.cpp		distparldalib.cpp
distparldalib.hpp		distparldalib.hpp
distvocab.cpp		distvocab.cpp
distvocabhdfs.cpp		distvocabhdfs.cpp
documents.cpp		documents.cpp
documents.hpp		documents.hpp
drand.hpp		drand.hpp
gibbs.cpp		gibbs.cpp
gibbs.hpp		gibbs.hpp
hdfs_support.cpp		hdfs_support.cpp
hdfs_support.hpp		hdfs_support.hpp
inverted_index.hpp		inverted_index.hpp
inverted_index_serialize.hpp		inverted_index_serialize.hpp
jch.cpp		jch.cpp
jch.hpp		jch.hpp
lda.cpp		lda.cpp
ldalib.cpp		ldalib.cpp
ldalib.hpp		ldalib.hpp
miniaturist.def		miniaturist.def
miniaturist.spdx		miniaturist.spdx
parlda.cpp		parlda.cpp
parldalib.cpp		parldalib.cpp
parldalib.hpp		parldalib.hpp
pylda.cpp		pylda.cpp
results.cpp		results.cpp
results.hpp		results.hpp
serialize.hpp		serialize.hpp
vocab.cpp		vocab.cpp

License

ct-clmsn/miniaturist

Folders and files

Latest commit

History

Repository files navigation

How To Build The Container

How To Build From Source

How To Use

How To Use Container

Implementation Notes

Usage Notes

HPX Compilation Flags

Licenses

Dependencies

Optional Dependencies

Special Thanks

References

Author

Date

About

Topics

Resources

License

Stars

Watchers

Forks

Languages