Skip to content

You-sha/Stock-Prediction-LSTM

Repository files navigation

Stock Prediction: LSTM & SA

Author: Mohammad Yousha

Predicting Kanoria Chemicals stock price using Long Short-Term Memory and Sentiment Analysis.

Progress:

  • Study LSTM, SA and learn their application.
  • Prepare data.
  • Build model and make predictions
  • Documentation.

Resources

Multi-Variate LSTM model: https://www.kaggle.com/code/amarsharma768/stock-price-prediction-using-lstm/notebook

Data Preparation

There were 3 million+ datapoints for the news data, and just about 3500+ for the stock data.

Here is what I did in this step:

  • Dropped news data that was from before the company's origin.
  • Removed the data that was from days when the market was closed or the stocks weren't traded.
  • There were multiple news headlines from different papers for each day, including ones useless for this purpose (entertainment, horoscopes, sports, etc.). I kept only the useful headlines and dropped the rest.
  • I randomly selected one headline for each day (since there were still multiple), and finally merged the news and stock data into one dataset.

Final dataset sample:

date headline_text open high low close adj close volume
0 2007-01-08 ULFA strikes again in Assam; kills nine people 28.666666 28.666666 28.666666 28.666666 16.249998 3600.0
1 2007-01-09 Marry-and-dump NRIs may face Indian law 28.100000 28.600000 28.000000 28.083332 15.919325 2490.0
2 2007-01-10 Kalam sets tone for engagement of global Indians 27.566666 29.033333 27.333332 27.566666 15.626451 32694.0
3 2007-01-11 Plan panel may cut SSA budget 27.700001 28.416666 27.666666 28.000000 15.872088 4800.0
4 2007-01-12 Bangladesh president resigns as chief advisor 28.299999 28.600000 28.116667 28.433332 16.117727 13122.0

3754 rows × 8 columns

Model Building

I have found a notebook explaining the usage of LSTM for stocks data and have modified the code in it to fit my use case. The original code can be found here.

Here are the changes I made:

  • Changed the code to fit 4 features instead of two.
  • Fixed the inverse transform parts.
  • Reduced the number of epochs.
  • Converted it into a function and made it reproducible.

LSTM Prediction

  • Train RMSE: 9.31
  • Test RMSE: 4.76

High prediction vs actual

  • Before the break around 2019 is the train set, and after that is the test set.

Next 10 days prediction:

Upcoming 10 days stock price

Since the data was limited to 30 March 2022 only because of the news headlines data; I had access to the actual stock price data for the days after that, and so I decided to compare my results with the actual price.

Next 10 days - Predicted vs Actual:

Actual vs Predicted 10 days stock price

  • From these results, it can be concluded that you should not use my model for actual investment.

Random Forest Prediction

Since my model also has to use the sentiment scoring that I performed for predictions, I have also made a Random Forest Regressor model. I have evaluated it on 3 fold cv, and have tuned it using RandomizedSearch.

I have used the features open, close, low, adj close, 'volume' of the stock data, and neg, neu, pos from the sentiment scoring as the independent variables and high as the target variable.

  • RMSE: 30.8
  • R2 Score: 0.69

Test set's Actual vs Predicted:

Actual vs Predicted stock price (RF)

  • The model's predictions seem to flatline around March-April 2021.

The Random Forest model does not seem to perfrom as well as the LSTM model. That may make sense as I read a scientific article stating that LSTM is currently one of the best models for stock prediction.

Still, it is best to try out different methods and find the best for yourself (as long as you have the time, of course).

Conclusion

I have made two tools to predict stock price:

  • One that uses time-series data and an LSTM model.
  • Another that uses sentiment scores from news headlines combined with the stock data and a Random Forest model.