Stock Market Price Predictor using Supervised Learning

Aim

To examine a number of different forecasting techniques to predict future stock returns based on past returns and numerical news indicators to construct a portfolio of multiple stocks in order to diversify the risk. We do this by applying supervised learning methods for stock price forecasting by interpreting the seemingly chaotic market data. The fluctuation of the stock market is violent and there are many complicated financial indicators. However, the advancement in technology provides an opportunity to gain steady fortune from stock market and also can help experts to find out the ost informative indicators to make better prediction. The prediction of the market value is of paramount importance to help in maximizing the profit of stock option purchase while keeping the risk low. We have used previous datasets of stocks and news headines for the forecasting.

Prerequisites

You need to have installed following softwares and libraries in your machine before running this project.

Python 3 Anaconda: It will install ipython notebook and most of the libraries which are needed like sklearn, pandas, seaborn, matplotlib, numpy, scipy,streamlit.

Libraries used

Pandas: For creating and manipulating dataframes.

Scikit Learn: For importing k-means clustering.

JSON: Library to handle JSON files.

XML: To separate data from presentation and XML stores data in plain text format.

Beautiful Soup and Requests: To scrap and library to handle http requests.

Matplotlib: Python Plotting Module.

DATA

the dataset we considered is web scrapped from APIs. The Historical Dataset came from NASDAQ API and News Articles are from Yahoo Finance

HistoricalData_APPLE.csv

Data Overview

Data Source --> Dataset/

Data points --> 2517 rows

Dataset date range --> October 2011 to September 2021

Dataset Attributes:

Close/Last - Close/Last Prices
Volume - Volume of Stocks
Open - Opening Prices of Stocks
High - Highest Prices of Stocks
Low - Lowest Prices of Stocks

Data Preprocessing

DATA CLEANING

Deleted "Unnamed:7" Column For "Nan" Values Parsed The Date attribute in "datetime64" data type. Checked For Duplicate Rows(Not Found). Dropped features which are of no use the model. Removed outliers from data and make it more clean to use further.

EDA(Exploratry Data Analysis)

Exploratory Data Analysis is a process of examining or understanding the data and extracting insights or main characteristics of the data. EDA is generally classified into two methods, i.e. graphical analysis and non-graphical analysis.

Technically, The primary motive of EDA is to

Examine the data distribution
Handling missing values of the dataset(a most common issue with every dataset)
Handling the outliers
Removing duplicate data
Encoding the categorical variables
Normalizing and Scaling

Here are some examples of data analysis we have done while exploring data

Data Visualization for all the columns for yearly wise

Data Visualization for all the columns for monthly wise

Data Visualization for all the columns for quarterly wise

Scatter PLot is Plotted between each Attribute(Trend)

Heat Matrix is Shown For Correlation Between Each Attribute(Linear Relation)

Data Modelling

So, after the exploratory data analysis we started modelling using Python.So for modelling we used Machine Learning algorithms on the datasets to build model to that will generate output for prediction of Stocks Price.In this step we have divided the data into train and test as 80%,20% respectively. In this process we have used many algorithms and applied some hyperparameter tuning so that our algorithms can do better. The algorithms which we have tried are:

Linear Regression
Naïve bayes
Neural networks

LINEAR REGRESSION

Linear Regression is a supervised learning algorithm in machine learning. It models a prediction value according to independent variables and helps in finding the relationship between those variables and the forecast and in this case we used last years dataset of companies to predict stocks value for future.

The accuracy score of model by linear regression
RMSE(Root Mean Sqaured Error) = 0.1459830874093662
R-2(R-Square Score) = 0.9998357614326422

Naïve Bayes

Naïve bayes is a probabilistic classifier, which means it predicts on the basis of the probability of an object. It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem -

Predicting the Impact of News articles on the Closed Price of the Apple Inc. Stocks using Naive Bayes Classifier. Firstoff all we merge the News Articles dataset and Historical Stocks Dataset into a single dataset on the 'Date' column after making some necessary changes to them. Now we add two more column named 'close_price_diff' and 'Impact' to the dataset, with 'close_price_diff' column containing the difference in Closed Price from the previous day and 'Impact' column containing 1 if the Closed Price difference is positive and 0 if it is negative. Afterwards we apply Natural Language Processing on the News Headlines text and obtain a Bag of words containing 20000 most common words from them by converting them to vectorized form. Now we train the Naive Bayes model (Gaussian, Multinomial or Bernoulli each in different files) by the splitting the dataset, 80% as training dataset and 20% as test dataset. Finally we do HyperParameter tuning to get the best predicted results. We are classifying the news articles such that our model helps in classifying the news articles to be a profit or a loss. We are doing this by calculating the diff in closed price of present day with the previous day.
The Accuracy score in Naïve bayes is 51.93%
And After Hyperparameter Tuning it increased to 53.29%

RNN LSTM

Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. As the name suggest Neural network, it is quiet like our brain where there are some neurons working to get us the output. Then comes RNN which is a type of Neural Network which uses sequential data or time series data. Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems.

The rmse score in LSTM is 101.3501

LSTM using 20 Days Data

Predicting the closing stock price of a Apple Inc. using the past 20 day stock price by an artificial recurrent neural network called LSTM. We combine Historical data of Apple stocks prices and News articles data after some necessary changes to make them useful to get a combined dataset. Then we apply Sentiment Analysis on the News Headlines of the dataset to get 'compound', 'positive', 'negative' and 'neutral' values from it. After making some necessary changes and visualizing the data in various ways, we finalize 'close_price' and 'compound' as our features and 'close_price' as our dependent variable. The model is then trained on 80% of the data after applying the Feature Scaling (MinMaxScaler) on the features and tested on the remaining. We train the model by adding sufficient number of LSTM and Dense layers and using appropriate parameters values. At last the model predicts the values of 21st day Closed Price using past 20 days Closed Price and Compound value generated from the news headlines.

Modeling And Deployment

The model we choose finally is Linear Regression and Deployed it on heroku and streamlit. we used flask framework to upload model on website. Deploying the LSTM Combined_Data using Streamlit. It uses predicts the Closed Price of 21st day using past 20 days Close Price. In this we combined the News Articles data and Historical data to form new dataset named stock_data. We used the stock_data to train our model of Neural Networks which is build by using LSTM and Dense layers. At last we save the model in .h5 format, which is used by app.py file to display the results in the Streamlit interface. The app.py file uses the model.h5 file and predicts the result. It also is used to design the Streamlit interface and manipulate what to show on it. Finally the user can interact with the index.html file to enter a date for which he/se wants the Closed Price to be Predicted.

Here is the deployment link of the model Click Here

Here are some screenshots of website deployed in Streamlit.

Steps that we performed:

Web scrapped
Data Loading
Data Preprocessing
Exploratory data analysis
Feature engineering
Feature selection
Feature transformation
Model building
Model evalutaion
Model tuning
Prediction's

Tools used:

Python
Pycharm
Jupyter Notebook
Google Colab
GitHub
GitBash
SublimeTextEditor

Team Members

Chandrachud Singh Chundawat
V. Nanda Gopal
Rahul Amarwal
Kondapu Lavanya
Sunil Mali
Sandeep Mannam
Giduturi Namrata Sai
Bale Meghana
Sital Agrawal

Team Leader

Chandrachud Singh Chundawat

Coordinator Name

Mr. Yasin shah

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
Dataset_Combining		Dataset_Combining
Headlines Datasets		Headlines Datasets
LSTM Deploy_Team A		LSTM Deploy_Team A
LSTM using 20 days Data		LSTM using 20 days Data
LSTM_Streamlit_Deploy_Team A		LSTM_Streamlit_Deploy_Team A
Linear Regression		Linear Regression
Naive Bayes Models with Hyperparameters tuned		Naive Bayes Models with Hyperparameters tuned
Naive_Bayes		Naive_Bayes
RNN LSTM		RNN LSTM
Project Tasks Report.xlsx		Project Tasks Report.xlsx
Readme.md		Readme.md

Technocolabs-Group-A/Stock--Price-Prediction

Folders and files

Latest commit

History

Repository files navigation