Skip to content

latent dirichlet allocation (topic modeling) implementations for hpc and cloud systems

License

Notifications You must be signed in to change notification settings

ct-clmsn/miniaturist

Repository files navigation

This project provides implementations of a Latent Dirichlet Allocation algorithm found here. Latent Dirichlet Allocation is colloquially called "topic modeling". This project offers topic modeling capabilities that work as a sequential program, a parallel (threaded) program, and a distributed parallel program that can be executed on High Performance Computing (HPC)/Supercomputing systems or a Cloud.

The following implementations are provided:

  • lda - the sequential implementation
  • parlda - the parallel implementation for multi-core systems
  • distparlda - distributed, parallel, implementation for High Performance Computers (HPC)/Supercomputers
  • distparldahdfs - distributed, parallel, implementation for Clouds (Hadoop filesystem, HDFS)

Optional extensions provided:

  • A plugin for the Phylanx distributed array toolkit
  • Python bindings (modules: pylda, pyparlda, pydistparlda)

The following tools are provided:

  • vocab - a sequential program to compute the vocabulary set found in all documents of a corpus
  • distvocab - a distributed memory program to compute the vocabulary set found in all documents of a corpus
  • distvocabhdfs - a distributed memory program to compute the vocabulary set found in all documents of a corpus on HDFS

The implementation uses a modified version of the collapsed gibbs sampler as defined by Newman, Asuncion, Smyth, and Welling.

Modifications have been made to the Newman, et al treatment in an effort to utilize term-document-matrices as the storage structure for document histograms.

This implementation only scales in the direction of documents. Users are required to provide a vocabulary list in order to make use of this implementation. Vocabulary lists can be populated programmatically or loaded into a modeling program with a file containing a 'new-line' delimited list of words.

The vocabulary building tools print out a set of words encountered during 1 linear traversal of the documents. All vocabulary building tools print results to stdout (the terminal).

The distributed vocabulary building tool prints to the stdout of each machine it is running upon; it is suggested that users pipe the output of the distributed vocabulary building tool to a distributed filesystem using a filename that is: unique to the locality identifier (integer) of the program instance, or into a remote /tmp directory that is accessible for a file copy (scp).

How To Build The Container

Container users should use the following options. The container is Ubuntu 20.04 based and uses some packages from 'universe'.

  • sudo singularity build miniaturist.sif miniaturist.def
  • singularity build --fakeroot miniaturist.sif miniaturist.def

The container uses wget to download a binary build of cmake from kitware's website and the OTF2 source from VI-HPS. Other software dependencies are git cloned from their respective source repositories and compiled into the container.

How To Build From Source

This project requires using cmake. cmake requires creating a directory called 'build'. Users will change directory into 'build'. At this point the user will need to type something like the following:

  • cmake -Dblaze_DIR=<PATH_TO_BLAZE_CMAKEFILE> -DHPX_DIR=<PATH_TO_HPX_CMAKEFILE> ..

These are possible directories where the blaze and hpx cmakefiles can be found:

  • PATH_TO_BLAZE_CMAKEFILE=/usr/share/blaze/cmake
  • PATH_TO_HPX_CMAKEFILE=/usr/lib/cmake/HPX

Add the following for Cloud (Hadoop File System - HDFS) support:

  • cmake -Dblaze_DIR=<PATH_TO_BLAZE_CMAKEFILE> -DHPX_DIR=<PATH_TO_HPX_CMAKEFILE> -Dlibhdfs3_DIR=<PATH_TO_LIBHDFS3_INSTALL> ..

This is a possible location for libhdfs3.so and hdfs/hdfs.h:

  • PATH_TO_LIBHDFS3_INSTALL=/usr/

Add the following for Python bindings to be built. To inform cmake where pybind11 is installed, use the -Dpybind11_DIR=<PATH_TO_PYBIND11_CMAKEFILE> flag.

  • cmake -Dblaze_DIR=<PATH_TO_BLAZE_CMAKEFILE> -DHPX_DIR=<PATH_TO_HPX_CMAKEFILE> -Dpybind11_DIR=<PATH_TO_PYBIND11_CMAKEFILE>

Here is a possible directory where the pybind11 cmakefiles are located:

  • PATH_TO_PYBIND11_CMAKEFILE=/usr/share/cmake/pybind11

How To Use

Topic Modeling Program names:

  • lda, single node, sequential (no threads), implementation
  • parlda, single node, parallel, implementation
  • distparlda, distributed (multi-node), parallel, implementation
  • distparldahdfs, distributed (multi-node), parallel, HDFS implementation

Command line arguments for all topic modeling programs:

  • --num_topics=[enter an unsigned integer value for number of topics], required
  • --vocab_list=[enter a valid path to the file containing the vocabulary list], required
  • --corpus_dir=[enter a valid path to the directory containing the training corpus], required
  • --regex=[enter a regular expression], default [\p{L}\p{M}]+
  • --num_iters=[enter an unsigned integer value for iterations], default 1000
  • --alpha=[enter a floating point number for alpha prior], default 0.1
  • --beta=[enter a floating point number for beta prior], default 0.01

Additional command line arguments for parlda:

  • --hpx:threads=[enter an unsigned integer value for number of threads], optional
  • --hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional

Additional command line arguments for distparlda:

  • --hpx:threads=[enter an unsigned integer value for number of threads], optional
  • --hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional
  • --hpx:nodes=[enter an unsigned integer value for number of threads], optional

Additional command line arguments for distparldahdfs:

  • --hpx:threads=[enter an unsigned integer value for number of threads], optional
  • --hpx:numa-sensitive=1, runtime system's thread scheduler considers numa domains, optional
  • --hpx:nodes=[enter an unsigned integer value for number of threads], optional
  • --hdfs_namenode_address=[enter string], required
  • --hdfs_namenode_port=[unsigned integer for hdfs namenode port], required
  • --hdfs_buffer_size=[unsigned integer buffer size for file reads from hdfs], default 1024
  • --hdfs_block_size=[unsigned integer buffer size for file writes to hdfs], default 1024

Vocabulary Building Program names:

  • vocab, single node, sequential (no thread), vocabulary builder
  • distvocab, distributed, sequential (no thread), vocabulary builder
  • distvocabhdfs, distributed, sequential (no thread), vocabulary builder for HDFS

Command line arguments for all vocabulary programs:

  • --corpus_dir=[enter a valid path to the directory containing the training corpus], required
  • --regex=[enter a regular expression], default [\p{L}\p{M}]+, optional
  • --filter=[unsigned integer frequency count above which vocabulary words are printed out], optional
  • --histogram, print out the global count of each word (default off), optional

Additional command line arguments for distvocabhdfs:

  • --hdfs_namenode_address=[enter string], required
  • --hdfs_namenode_port=[unsigned integer for hdfs namenode port], required
  • --hdfs_buffer_size=[unsigned integer buffer size for file reads from hdfs], default 1024
  • --hdfs_block_size=[unsigned integer buffer size for file writes to hdfs], default 1024

Topic Modeling Libraries:

  • libldalib.a, ldalib.hpp single node, sequential (no threads), implementation
  • libparldalib.a, parldalib.hpp single node, parallel, implementation
  • libdistparldalib.a, distparldalib.hpp distributed (multi-node), parallel, implementation

The Python bindings requires users to type the following in python3.8:

from pylda import lda

How To Use Container

Use the following command line arguments to print out the command line arguments enumerated above for each program in the container.

  • singularity help miniaturist.sif
  • singularity help --app vocab miniaturist.sif
  • singularity help --app distvocab miniaturist.sif
  • singularity help --app distvocabhdfs miniaturist.sif
  • singularity help --app lda miniaturist.sif
  • singularity help --app parlda miniaturist.sif
  • singularity help --app distparlda miniaturist.sif
  • singularity help --app distparldahdfs miniaturist.sif

Implementation Notes

This implementation loads the corpus into an inverted index. The inverted index is converted into a sparse matrix that is transposed into a document-term matrix. This step is required to permit the parallel processing of the corpus by document. The implementation processes subsets of the document-term matrix in parallel chunks. A sparse matrix is used to store the document-term matrix to minimize a required O(N^2) algorithmic cost spent traversing the matrix.

The most time-consuming portion of each implementation is the creation of the sparse matrix which stores the document-term matrix. To accelerate the creaton of the matrix, it is strongly encouraged that a user spends a fair amount of time studying the corpus to identify a reasonably sized vocabulary set. The larger the vocabulary set the sparser the document-term matrix becomes. There is a direct correlation between each implementation's performance and the size of the vocabulary set.

Usage Notes

If you use the vocabulary building tools with a specific regular expression in mind, make sure to use the same regular expression when invoking lda, parlda, or distparlda. Consistent use of regular expressions parsing text when building the vocabulary and when modeling is important for successful program execution.

HPX Compilation Flags

Take time to review the following build options for HPX here. Below are a short list of recommended options.

For Cloud Environments:

  • HPX_WITH_PARCELPORT_TCP=ON
  • HPX_WITH_FAULT_TOLERANCE=ON

For HPC Environments w/MPI:

  • HPX_WITH_PARCELPORT_MPI=ON

For HPC Environments w/libfabric:

  • HPX_WITH_PARCELPORT_LIBFABRIC=ON
  • HPX_WITH_PARCELPORT_LIBFABRIC_PROVIDER=[gni,verbs,psm2,etc]

Dynamic runtime instrumentation for Cloud and/or HPC:

  • HPX_WITH_APEX=ON (requires APEX installation)

For OpenMP enabled LAPACK, BLAS, Blaze implementations:

  • hpxMP - provides HPX/APEX the ability to manage OpenMP

Licenses

  • The Jump Consistency Hash source code was taken from a blog post with no license provided. The blog post and author are referenced in a comment at the top of the file 'jch.hpp'.

  • The drand48 logic comes from musl-libc. The source code is MIT Licensed.

  • The remainder of the source code in this project is Boost Licensed and the license terms can be found in the file 'LICENSE'.

Dependencies

  • C++17
  • STE||AR HPX
  • Blaze
  • ICU
  • LAPACK
  • BLAS
  • OpenSSL
  • pkg-config
  • cmake >= 3.17

Optional Dependencies

  • APEX
  • hpxMP
  • Phylanx
  • pybind11 (Python support)
  • libhdfs3 (Hadoop Filesystem/HDFS support)
  • singularity

Special Thanks

References

  • D. Newman, A. Asuncion, P. Smyth, M. Welling. "Distributed Algorithms for Topic Models." JMLR 2009.
  • H. Kaiser, M. Brodowicz and T. Sterling: ParalleX: An Advanced Parallel Execution Model for Scaling-Impaired Applications, International Conference on Parallel Processing Workshops (2009 – Los Alamos, California).
  • Kevin Huck, Allan Porterfield, Nick Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, Rob Fowler. An Autonomic Performance Environment for Exascale. Supercomputing frontiers and innovations, 2.3 (2015).
  • Jeremy Kemp; Tianyi Zhang; Shahrzad Shirzad; Bryce Adelstein Lelbach aka wash; Hartmut Kaiser; Bibek Wagle; Parsa Amini; Alireza Kheirkhahan. "hpxMP v0.3.0: An OpenMP runtime implemented using HPX".

Author

Christopher Taylor

Date

07/03/2021