Skip to content

chawla201/The-Winton-Stock-Market-Challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Winton Stock Market Challenge

Winton Stock Market Challenge was a competition hosted by Winton on Kaggle in 2016. The main task of this competition was predict the interday and intraday return of a stock, given the history of the past few days.
NOTE:
To view the final code with the interactive graphs, click here

tl;dr

  • Developed a data pre-processing pipeline
  • Tuned and Trained a Multi-Output Multi-Layer Perceptron Regression Model to predict stock returns based on returns from past two days and a set of features

Data

In this competition the challenge is to predict the return of a stock, given the history of the past few days.

We provide 5-day windows of time, days D-2, D-1, D, D+1, and D+2. You are given returns in days D-2, D-1, and part of day D, and you are asked to predict the returns in the rest of day D, and in days D+1 and D+2.

During day D, there is intraday return data, which are the returns at different points in the day. We provide 180 minutes of data, from t=1 to t=180. In the training set you are given the full 180 minutes, in the test set just the first 120 minutes are provided.

For each 5-day window, we also provide 25 features, Feature_1 to Feature_25. These may or may not be useful in your prediction.

Each row in the dataset is an arbitrary stock at an arbitrary 5 day time window.


Technologies Used

  • Python
  • Pandas
  • Numpy
  • Matplotlib
  • Seaborn
  • Plotly
  • Scikit Learn
  • Principle Componnent Analysis
  • Iterative Imputer
  • Random Forest Regressor
  • Multi-layer Perceptron Regressor
  • Multi Output Regressor

Exploratory Data Analysis

Exploratory Data Analysis is performed to explore the structure of the data, identify categorical and continuos data feilds, missing values, and corelations amongst different data columns

Corelation Heatmap between diffent features:

Feature Engineering

As observed in the corelation heatmap above, alot of features are strongly corelated to each other. This means that it is possibble to apply Dimentionality Reduction methods such as Principle Component Analysis.
Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
The optimum number of principle components can be found by observing the variance for different sets of components. The set with variance closest to one is concidered as the one with optimum number of principle components.
Here we can observe that the optimum number of components is 12
To simplify the problem, the intraday returns are aggregated as sum and standard deviation for both features (Ret_2 to Ret_120) and target labels (Ret_121 to Ret_180) Standard deviation of the interday returns is also considered to see how much the returns vary.

Model Building

After imputing missing values and executing Principle Component Analysis on the numerical data columns, the categorical data was transformed into dummy variable columns using Pandas' get_dummies() feature.
The data was split into training (70%) and testing (30%) data.
I tried two different models:

  • Random Forest Regressor: For baseline model
  • Multi Layer Perceptron Regressor (MLPReggresor): Since the data involved feature values of different ranges, I thought a Multi Layer Perceptron model will be resistent to those variations
Since the problem statement dictates us to predict multiple values, MultiOutputRegressor is used.
y_test (blue) vs. y_pred (orange) for first 500 data points in test data
MLP Regressor
Random Forest Regressor

Hyperparameter Tuning

As seen in the graphs above, the prediction lined for Random Forest Regressors are mostly flat lines with a few sparse peaks. While on the contrary, Multi-level Perceptron Regressor shows way better results. Thus only Multi-level Perceptron Regressor underwent hyperparameter tuning. Grid Search Cross Validation method is used to fine tune the regression model. The best model obtained after hyper parameter tuning is:

Model Evaluation

Mean Absolute Error (MAE) is used the performance metric for evaluating the regression model. MAE is easy to interpret and provides a clear view of the performance of the model.

The Mean Absolute Error of the model = 0.01366