Skip to content
Sam Taieb edited this page Jun 13, 2019 · 6 revisions

Theoretical Foundation

LSTM is a special type of RNN that remedies the vanishing gradient problem first introduced in 1997. It replaces neurons with memory cells to achieve long-term dependencies. Compared to RNN, LSTMs have a similar structure, but instead of only having one sigmoid layer to be activated, they have four neural networks interacting with each other.

Layers in LSTM architecture consist of recurrent computational units called memory blocks or memory cells. Each memory cell has four components: an input gate, forget gate, output gate, and cell weights. This structure enables the previous activation of the prior timestamp to be used as a context to formulate a more accurate output. This unique architectural design of LSTM overcomes the vanishing gradient problem. LSTM

LSTM cell

LSTM gates:

The idea behind having a gate is to learn when to open and close access to the error flow within each memory cell. The sigmoid function is used to give only positive values which then make it possible to decide whether to keep or ignore a feature. By multiplying the gate output point-wise by a state, information is deleted (output of gate = 0) or kept (output of gate = 1). When the output of the gate is zero, the gate is blocking everything. An output of one means that everything is being allowed to pass through.

1. Forget gate (reset):

Has a one-layer neural network and decides what information to discard from the cell. Its inputs are the output of the previous LSTM block (ht􀀀1) and Xt (the current input). Xt and ht􀀀1 are concatenated in one vector and then squashed in a sigmoid function, meaning any number will get squashed to values between 0 and 1. After that, the memory flow goes through a piece-wise summation with the old memory state Ct􀀀1 to merge the old and new memory.

2. Input gate (write):

Also called ignoring gate. It decides which values from the input will update the memory state. The concatenated vector between the previously hidden state ht􀀀1 and the new input xt then goes into several sigmoid and hyperbolic functions.

3. Output gate (read):

Decides the cell outputs based on input and the memory of the cell. This cell is responsible for generating the final output of this LSTM unit, the concatenated vector Xt and ht􀀀1 squashed again in a sigmoid function. The output is called “selected value.” This goes into piece-wise multiplication with the result of the new state, which has already been squashed through a _tanh _function (new state candidate value). The final product of this process is new hidden state ht.

LSTMmove

Implementation

1. Data Base conversion:

The database was imported and transformed into .csv files using DB Browser for SQLite (https://sqlitebrowser.org/). The database can be found here: https://nextcloud.os.in.tum.de/s/panda_v3

DB

2. Import the project from Github:

Import

Copy the extracted .csv files into the same directory where the project is located.

Indexing will take a couple of minutes and after it is done, these libraries should be installed by typing pip install (library name & version) into the terminal:

  • cuDNN 7.5 (only if using GPU)
  • Keras 2.2.4
  • Pandas 0.24.2
  • Scikit-learn 0.20.3
  • tqdm 4.31.01
  • CUDA 10.1 (only if using GPU)
  • Tensorflow-gpu 1.13.1 (when using CPU, install Tensorflow 1.13.1)
  • graphviz 2.40

3. Extract both features and labels:

Run the Data_preparation.py file to extract both features and labels, this process takes on average 90 to 130 hours depending on the vector length and the CPU specifications. In the end, 2 files (56_features and 56_labels) will be saved. Note that the length of features and labels vectors can be changed by editing the value of maxlen in line 165.

4. Network definition

LSTM’s input shape argument expects a three-dimensional array as an input in this order: Samples, timestamps and features. That is why a new dimension to the numpy array is needed. The features array is expanded to a two-dimensional array by assigning the timesteps to the first dimension and second dimension to the features. In this context, fifty-six time-steps and one feature at each time-step are represented by the notation: (56,1). As illustrated in Figure below, each batch consists of a predefined number (=n) of task-sets. At the first time-step, the first feature of the first batch is relayed to the respective cell (cell 1) in the layer. Analog, the 2nd, 3rd, ..., 56th feature is then sent from the respective batches at the respective time-step. Several samples are not defined, so that the model cannot be constrained with the dimension of the current data-set. The data-set set was split into a training-set (70%) and test-set (30%) for validation. model

The graphical representation of the model

To train the model run _CuDNNLSTM.py_ and if when using a cpu, CuDNNLSTM should be replaced by LSTM:

Cu

By the end of this step, both the trained model and its log file will be saved. To visualize the results using tensorboard, type tensorboard --logdir=logs/ into the terminal:

tensorboard

5. Plotting the Model:

By running Plotting.py a modle.jpg picture will be generated to give a better understanding of the built model.

6. Evaluation:

After training the model, it is necessary to evaluate accuracy by running Evaluation.py. Usually, the evaluation happens on the test data split. The Scikit-learn framework provides 3 APIs to evaluate the quality of the model prediction: The estimator score method, scoring parameter, and metric functions. Metric functions were implemented here, and different performance classifications were measured. One way to explicitly illustrate the estimates is the confusion matrix. The matrix labels an output of negative vs. positive values; here, positive/negative implies whether the result of a task-set has been correctly predicted.

7. Prediction:

By running the prediction.py on the saved model, a Predicion_results.csv file will be generated with three columns: TaskSet_ID, actual value, and predicted value.

8. Optimization:

Dealing with overfitting is an important issue as overfitting decreases prediction accuracy. Multiple steps were taken to address the overfitting of this model. The success of the neural network model depends on multiple points joined together to tune the model:

  1. Model structure
  2. The data presented
  3. Optimization parameters and hyperparameters

Only the first and third points were considered. In parallel_search.py file, the easy-to-implement, widely-used parallel search approach is used by generating a grid of different combinations of parameter values. Each combination results in a specific model, which is then saved along with the log file containing classification reports and graph structures. Each model is then evaluated by visualizing the results in Tensorboard. The combination that achieves the best result is then selected. The following parameters were adjusted:

  • LSTM network numbers.
  • Size of LSTM.
  • The number of dense layers.
  • Size of dense layer.
  • The number of epochs: set to 200, but early stopping (i.e., training stops if the validation loss starts to increase again) is implemented.
  • Batch size.
  • Dropout ratio.
  • Learning rate.

All combinations are saved and exported as a .csv file after each run.