Skip to content

Latest commit

 

History

History

markov-chain-generator

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Markov Chain Generator

Markov Chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

This application calculates word transition probabilities from the classical books and stores those in a Hazelcast IMap.

Then they are used to create sentences of specified length.

To learn more about Markov Chains, see: Wikipedia Article

Package Level Structure

The main class (MarkovChainGenerator) with the main method resides in the default package.

Data Pipeline

There is almost 100 MB of classical literature on the input, see the resources. Jet pipeline builds the Markov Chain based on this data set. The Markov Chain is then used to generate random sentences such as

Very tall masts went on any kind for doing that they transmit by an oversight.

As you can see, the sentence is meaningless. It's structure is however similar to an English sentence as the Markov chain was "trained" on real English texts.

Jet makes the computation parallel.

In the first step, Jet reads the books stored in text files and converts single words to pairs. So for The Picture of Dorian Gray we get following pairs representing the word to word transitions:

(The, Picture), (Picture, of), (of, Dorian), (Dorian, Gray)

To compute the probability of finding word B after A, we have to know how many pairs contain word A as a fist entry and how many of them contain B as a second entry. When we compute this for all (A,B) pairs appearing in the source text we get the transition matrix representing the Markov Chain.

The aggregation features of Jet are used to do the heavy lifting. Jet runs the computation in parallel to make use of all available processor cores.

The computed Markov Chain is stored to an IMap (distributed map implementation available in Jet) and printed to standard output. See the probabilities for the word organizations:

Transitions for: organizations
/-------------+-------------\
| Probability | Word        |
|-------------+-------------|
|  0,2500     | with        |
|  0,5000     | can         |
|  0,7500     | so          |
|  1,0000     | did         |
\-------------+-------------/

This Markov Chain is then used to generate a sentence. Starting from . word, next word is iteratively selected based on the probability of appearing after a previous one. The sentence is printed to standard output.

Building the Application

To build and package the application, run:

mvn clean package

Running the Application

After building the application, run it with:

mvn exec:java