Skip to content

George091/CNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Assignment 4

Part 1 - 1D CNN and RNN for sentiment analysis

This is the CNN project by George Barker and Andre Zeromski.

How to run code:

To test accuracy:

  1. Run RNNCNNconda.py
  2. Model we trained that has 89.94% accuracy will be evaluated

To train model:

  1. Comment out lines 53 and 54 (where the model is pickled in)
  2. Uncomment lines 60
  3. Run RNNCNNconda.py
  4. Wait for model to be trained.

Data Processing:

As stated in the IMDB RNN project for our data preprocessing: “We imported the IMDB movie reviews dataset with “from Keras.datasets import imdb” and the function imdb.load_data(num_words=None). This returns (x_train, y_train), (x_test, y_test) for the IMDB reviews with each word replaced by the integer frequency that the word occurs in the whole of the training set. Each review is therefore a sequence of words represented as integers. Each review has a y value of 0 or 1, corresponding to whether the review is negative or positive. The train data and test data is split 50 / 50, so our train data has 25,000 reviews and the test data has 25,000 reviews. The num_words parameter allows us to specify the amount of words we’d like to keep in the vocabulary, so only the top “num_words” are kept. This is beneficial as it allows us to reduce our vocabulary. The vocabulary of the IMDB reviews is about 90,000 unique words, when imported using the assigned keras import. This large of a vocabulary is problematic as it is more difficult to train a model to understand such a large vocabulary accurately. To deal with this large vocabulary we can reduce it. Zipf’s Law states that the frequency of any word is inversely proportional to its rank in the frequency table. Therefore, we can increase accuracy by reducing a large part of the vocabulary by taking the most frequently occurring words; this should also not remove a significant portion of meaning from the review since the most frequently occurring words are used much more. We started by limiting the vocabulary to 5000 words. We found that when limiting the vocabulary further to 4000 words, accuracy was increased. Limiting the vocabulary beyond this number seemed not to help with our model’s accuracy.

We further manipulate the dataset by padding and reducing the length of each review. The mean review length is about 235 words. We limited the review lengths to 800 because we can cover the majority of reviews without truncating the end of the review with this review length. Since we are working with varied input data, we use a padding technique to standardize the inputs by appending 0s to inputs with length less than our specified max review length. This normalization technique ensures all of the inputs that are fed into the model are the same size as smaller reviews are padded to a length of 800 and larger reviews are truncated after 800 words.

We create our model by calling Sequential() to initialize our linear stack of layers. Our first layer is the embedding layer. We used the keras implementation of the embedding layer to convert our integer representation of words into a word embedding. For each word in our input sequence, the embedding layer outputs the word’s vector representation. The embedding layer has two required arguments. We set the first argument to the number of words in our vocabulary, and since the IMDB dataset consists of integer representations of words as determined by the words’ frequencies, only numbers with a value less than or equal to 4000 are embedded. The second argument is the desired length of the embedded vector for each integer representation of a word, which in our case, we set to 50. We include an optional argument, input_length, set to the maximum review length, which essentially specifies the length of the input sequences.

As stated previously, we manipulated our input sequences of movie reviews to all be a fixed length of 800 by using padding. Therefore, a single input sequence as represented by the embedding layer consists of a 2D matrix, with one dimension being the number of words in each padded review, and the second dimension being the embedded word vector of length 50. The embedding layer is used as the first layer in our model and converts the positive integer representing a word into vectors of fixed size.”

We started with a basic 1D cnn and RNN structure. We used an embedding layer, a conv1D layer, a maxpool1D layer, a simple RNN layer, a dropout layer, a dense layer, and sigmoid-activated 1 node output layer. For our other layers, our default activation functions were ReLu because it is efficient to use. The idea here is we can perform dimensionality reduction to extract features using the 1D CNN and max pooling layer, the RNN can analyze sequential features of this data, and finally our dense layer can put the result of the features from the RNN together to make a classification on sentiment analysis.

Model Implementation

Our first notable model received an accuracy of 87%. For this model, we settled on a max pooling size of 4, a kernel size of 7, filter size of 32, and a dropout layer of .3 between CNN and RNN layers. We found that the dropout layer and increased filter size was the most important component compared to previous models for increasing our accuracy. We noticed the dropout layer was especially helpful for closing the discrepancy between the training set and testing set accuracy.

We then added a bidirectional LSTM layer with 64 cells and ran the model. We got a testing set accuracy of 86.5%, but a training set accuracy of 99%. It was clear we had overtrained, but this model had a lot of potential with accuracy still at 86.5%. We added dropout between embedding layer and the first convolutional layer, reduced epochs and got 88% on testing set with 90% on training data. We further increased filters in CNN to 64 and received 88.6% accuracy.

We would like to note here we tried to add regularization to the CNN conv1D layer by adding L2 kernel and activity regularization with a decay rate of .00005. To see if we could close the discrepancy between the training set accuracy and the testing set accuracy. This did not significantly improve the accuracy, so we upped the decay rate to 0.001 of the L2 kernel_regularizer in our Conv1D layer. This also did not improve the accuracy of the model. We decided to take out all the regularization in our model except the dropout layer to see what accuracy we could obtain. We remained at a similar accuracy to our other models. Since the L2 regularization did not impact our accuracy, we took it out and used only dropout layers.

We wondered if our dense layer was limiting the classification, so we increased the nodes of our dense layer. Increasing the nodes of the dense layer had no increase in accuracy. We were still stuck with no more than 89%. Since most of these added nodes had no effect on accuracy, we decided to stick with the models that had less parameters, according to Occam Razor's principle. We removed the dense layer after the LSTM, so that the LSTM directly fed into the output prediction node. Our output layer applied a standard sigmoid activation function to one node because our model is tasked with making a prediction of a binary classification.

We decided that our CNN should perform further feature extraction and dimensionality reduction, so we added an identical convolution and max pooling layer following the first 1d CNN. We believed that this further step added adequate preprocessing of the movie review for the Bidirectional LSTM to capture short and long range dependencies to make an accurate prediction. We added dropout between each max pool and convolution layer, and after the LSTM. We tried using average pooling instead of max pooling, but there was no improvement in accuracy. We also experimented with padding. Interestingly enough, we found that same padding performed much better than valid padding in the convoluted layers. We believe that this is because valid padding does not allow the convolution layer's filter to capture enough information at the end and beginning of the movie review. By changing the padding to same, the filter was able to move over the items at the beginning and end of the review the same amount of times it moves over the items in the middle of the review, and thus captures this periphery information.

We found our biggest breakthrough from 88% - 89% by decreasing vocab to 4000, increasing filters number and LSTM layer to 128, and using batch normalization. Decreasing vocab is explained above in data preprocessing. We determined empirically that reducing the vocab was helpful in consistently obtaining higher accuracy. Increasing the filter numbers and LSTM layer nodes was most likely helpful because by adding an additional Conv1D layer and Max Pooling layer we were extracting more abstract features, and to accurately capture these relationships an increase in nodes was required.

Finally we found that adding Batch Normalization helped. Batch normalization is helpful as it adds regularization effects by adding noise to the hidden layer output. We added batch normalization after every convolutional layer and max pooling layer. Batch normalization adds noise to each hidden layer’s activation, where it shifts and scales the activation outputs by a randomly initialized parameters. It shifts and scales in a manner similar to how normalizing a distribution works. This has the effects of regularization, as it allows each layer to learn by itself more, since this shift and scale acts independently of other layers.

We also experimented with the batch size, thinking that a smaller batch size would yield a more fine-tuned adjustment of the weights in the model, and therefore capture results with a higher accuracy. We found that decreasing the batch size from 64 to 32 in model.fit consistently provided a 1-2% increase in testing set accuracy. At this point we were consistently breaking into 89.5% accuracy on the testing set. We used binary cross entropy because we want our model to greatly penalize predictions that are confident and incorrect. Cross Entropy is a loss function that measures the performance of classification where the output is between 0 and 1. The equation for binary cross entropy is -(ylog(p)+(1-y)log(1-p)). We use binary cross entropy since we are only categorizing two classes, whether the sentiment is positive or negative. Cross Entropy more greatly punishes errors that are confident and wrong.

We tried using different optimizers. We tried using Adagrad and SGD optimizer, but we got no improvement beyond 88% accuracy, as compared with our 89.94% with Adam Optimizer. Therefore, we used adam optimizer. We did change the learning rate from its default parameter. We found that using a learning rate of .0006 allowed us to get 89.94% accuracy on our test set, as compared to 89.5% when using the default parameter. This learning rate was especially helpful in getting the top accuracy as we found at the default learning rate of .001, the adam optimizer had overtrained on the data between our training epochs. By lowering our learning rate we were able to obtain a bit higher of an accuracy by avoiding overtraining.

Finally, our metric was accuracy, as we wanted our model to just be able to get the most number of predictions correct out of the testing set. We thought accuracy was adequate for the task.

Comparison of RNN model to CNN-RNN model:

Our RNN model tested at 87.7% accuracy on the testing set. Our CNN-RNN model tested at 89.94% on the testing set. We can see our CNN-RNN model outperformed our model using only the RNN. Since we have put a 1D CNN first and there is an increase in accuracy, we believe the CNN is extracting features more important features for the RNN to use. In the case of not having the CNN for preprocessing of data, the RNN is focusing on features that may not be the most useful features to focus on. An RNN is limited by its sequential nature of feature extraction. An LSTM is going to keep track of the global order of features. In something such as movie reviews, the review may jump around when describing the attributes of a movie. The global order may not be the most important feature. The CNN is better at extracting features in a global context for the RNN to then use and make a prediction, which explains the increase in our accuracy when using an CNN-RNN model.

Part 2 - CNN for CIFAR

How to run code:

To test accuracy:

  1. Run CNNCIFAR.py
  2. Model we trained that has 80.93% accuracy will be evaluated

To train model:

  1. Comment out lines 68 and 69 (where the model is pickled in)
  2. Uncomment lines 72
  3. Run CNNCIFAR.py
  4. Wait for model to be trained.

Dataset and Data Manipulation

The aim of this CNN is to classify 32x32 images from the CIFAR10 dataset into 10 categories. The CIFAR10 set consists of 60,000 samples, where each image falls into one of ten classes (airplanes, cars, birds, horses, cats, deer, dogs, frogs, ships, and trucks). Each class has 6000 images and each class is mutually exclusive. The dataset is split into 50,000 testing and 10,000 training images. Each pixel is composed of 3 color channels, so each image has 32x32x3 = 3072 unique values composing it. As they are RGB images, each color channel saturation ranges from a value of 0 to 255. To regularize each image, we transform each image’s pixel array by turning these tensors into float32 types and dividing them by 255. We do this so each image becomes a tensor containing values between 0 and 1. By reducing the values to 0 and 1, we avoid large values propagating forward through the network and exploding in size, overwhelming the CPU. Each class is associated with a number from 0 to 9, representing the 10 different categories. These categories are the labels for each image. We convert each of these labels into a one hot encoded vector using the built in keras function, to_categorical. We want our labels to be in one hot encoded form so that the values that come out of the model’s output layer (represented by an array of size 10) has a value to compare it to.

Model Implementation

We started with a base vanilla CNN model composed of a conv2d layer with a 3x3 filter, followed by a 2D maxpooling layer with a 2x2 pooling window. We then repeated these layers two more times, yielding a total of 3 convolution and max pooling layers each. The final maxpooling layer is connected to two dense layers, on with 500 nodes and another with 10 nodes (representing our output layer), so that our model can perform proper classification on the features extracted from the convolution and pooling layers. Our output layer is subjected to a softmax activation function, so that it can make a proper prediction of the image based on the ten classes. The idea behind this model is that we wanted basic features to be extracted at the first layer, more advanced features abstracted at the second convolution layer, and the most abstract features extracted at the last convolution layer.

With no drop out or regularization techniques applied, we found that the model's accuracy on 20 epochs was 96.64% on the training set, but 71.24% on the test set. Clearly overfitting, we reduced the number of epochs to 10, and saw an accuracy of 85.26% on the training set, and 73.83% on the test set.

Next, we added dropout after each convolution and max pool layer. We began with a dropout rate of .2, and ran the model for 50 epochs, resulting in 86% accuracy on the training set and 75.6% accuracy on the test set (results are pickled in "CNN-CIFAR"). Since we continuously achieved higher training accuracy over the testing accuracy, we knew that regularization would improve our results as it would make the model more generalizable and thus perform better on the testing set. We decided to then increase regularization by increasing dropout rate from .2 to .4 in the conv2D and maxpooling layers. We also increased epochs to 100 to compensate for the increase in required training due to the increase in regularization. We ran the model for 100 epochs and got 77.18% accuracy on the training set and 75.97% accuracy on the test set (results are pickled in "CNN-CIFAR-1-ANDRE"). We noticed that there was not much increase in accuracy after 50 epochs for this model for the training set. Therefore, we decided to run a model with .3 drop out at 20 epochs. We had the understanding that .4 dropout may be too much, and 100 epochs may be too large. We also decided to include dropout only after the maxpooling layers. When we changed parameters to .3 dropout at 20 epochs we got 81.9% accuracy on the training set and 78.6% accuracy on the test set (results are pickled in "CNN-CIFAR-1-GEORGE").

Our model optimized using adam. We used categorical cross entropy because we wanted to penalize our model for making confident and incorrect predictions on our ten categorical classifications. We also used accuracy as the model's metric.

Further Optimization Techniques

Next, we experimented with adjusting the size of the kernel in our convolution layers. Changing the kernel size to 4x4 resulted in a train accuracy of 83.32% and a test accuracy of 77.57%, a decrease of about 1%. From here, we decided to keep our kernel size to 3x3. Once again we were worried about overfitting, so we added more regularization techniques. We added a L2 kernel regularizer to each convolutional layer, with an arbitrary weight decay of 1^-5. We tried a larger weight decay (5^-5) and a smaller one (5^-6), and we did not see any significant increase in accuracy. A kernel regularizer is a weight regularization technique. It adds a layer-specific penalty for the weight size in addition to the loss of the layer determined by the model’s overall loss function and optimization technique. L2 penalizes each weight by its squared value multiplied by the weight decay. In the distinguished paper on ImageNet, Alex Krizhevsky explores the effects of kernel regularization on image classification, and determines that weight decay was important for the model to train; it actually reduces the training error in addition to the overfitting problem (Krizhevsky, 2012). Kernel regularization works via the assumption that smaller weights make the model simpler, thus preventing overfitting, and makes the kernel more robust in feature extraction as it is not overly reliant on one large-valued connection. We also added batch normalization after every convolutional layer and max pooling layer. As explained in Part 1, batch normalization adds noise to each hidden layer’s activation. This has the effects of regularization, as it allows each layer to learn by itself more, since this shift and scale acts independently of other layers.

We decided to change the dropout rate to .2 after our CNN layers and .3 in the dense layer. This model resulted in an accuracy of 87.42% on the training data and 79.85% in testing data (results are pickled in "CNN-CIFAR-2-George"). We decided to up the epochs from 50 to 100 and we received an accuracy of 89.93% on the training data and 80.93% in testing data (results are pickled in "CNN-CIFAR-3-George"). We tried increasing batch size from 64 to 128 and received an accuracy of 88.07% on the training set and 80.78% on the testing set (results are pickled in "CNN-CIFAR-2-ANDRE"). We also tried a batch size of 32 and obtained an accuracy of 88.07% on the train and 80.78% on the test. Thus, we stuck with a batch size of 64 as it yielded the best results.

Final Remarks

To make our model design more explicit according to Occam’s Razor analysis, we will discuss the reason for selection of number of filters in our CNN layer. In our first layer we had 32 filters. Beyond this we saw no increase in accuracy for the CNN. We decided to go with the lowest number of filters we could have without decreasing the accuracy. With more parameters, we have more training time and more of a chance for the network to make a mistake in training. If we have more nodes than necessary, the model may find a relationship between the input data and output classification that does not exist. Thus we kept model parameters as low as possible without jeopardizing accuracy. Additionally, in the preceding convolutional layers we doubled the parameters from the previous convolution layer (first layer had 32 filters, second layer had 64 filters, and third layer had 128 filters). We did this to allow more abstract relationships to be extracted as the layers extracted more abstract features. For more abstracted relationships to be extracted we needed to increase the number of filters, so as to be able to capture these relationships. Finally, our dense layer was 500 nodes, but that was also the optimal size of the layer for our model without decreasing accuracy.

To see if we could break through even higher accuracy we tried implementing data augmentation into our model. We created an Image data generator, and we ran this model for 50 epochs and received a training set accuracy of 80.4%. We decided to ditch this model since it provided no significant benefit over our previous model, but we did consider data augmentation. Future work could look into implementing data augmentation in different ways to see if additional accuracy could be obtained.

Sources:

https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c https://arxiv.org/abs/1502.03167 https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/ https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/ https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html https://keras.io/optimizers/ https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ https://machinelearningmastery.com/weight-regularization-to-reduce-overfitting-of-deep-learning-models/ https://www.cs.toronto.edu/~kriz/cifar.html

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages