The Winton Stock Market Challenge

Winton Stock Market Challenge was a competition hosted by Winton on Kaggle in 2016. The main task of this competition was predict the interday and intraday return of a stock, given the history of the past few days.
NOTE:
To view the final code with the interactive graphs, click here

tl;dr

Developed a data pre-processing pipeline
Tuned and Trained a Multi-Output Multi-Layer Perceptron Regression Model to predict stock returns based on returns from past two days and a set of features

Data

In this competition the challenge is to predict the return of a stock, given the history of the past few days.

We provide 5-day windows of time, days D-2, D-1, D, D+1, and D+2. You are given returns in days D-2, D-1, and part of day D, and you are asked to predict the returns in the rest of day D, and in days D+1 and D+2.

During day D, there is intraday return data, which are the returns at different points in the day. We provide 180 minutes of data, from t=1 to t=180. In the training set you are given the full 180 minutes, in the test set just the first 120 minutes are provided.

For each 5-day window, we also provide 25 features, Feature_1 to Feature_25. These may or may not be useful in your prediction.

Each row in the dataset is an arbitrary stock at an arbitrary 5 day time window.

Technologies Used

Python
Pandas
Numpy
Matplotlib
Seaborn
Plotly
Scikit Learn
Principle Componnent Analysis
Iterative Imputer
Random Forest Regressor
Multi-layer Perceptron Regressor
Multi Output Regressor

Exploratory Data Analysis

Exploratory Data Analysis is performed to explore the structure of the data, identify categorical and continuos data feilds, missing values, and corelations amongst different data columns

Corelation Heatmap between diffent features:

Feature Engineering

As observed in the corelation heatmap above, alot of features are strongly corelated to each other. This means that it is possibble to apply Dimentionality Reduction methods such as Principle Component Analysis.
Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
The optimum number of principle components can be found by observing the variance for different sets of components. The set with variance closest to one is concidered as the one with optimum number of principle components.
Here we can observe that the optimum number of components is 12
To simplify the problem, the intraday returns are aggregated as sum and standard deviation for both features (Ret_2 to Ret_120) and target labels (Ret_121 to Ret_180) Standard deviation of the interday returns is also considered to see how much the returns vary.

Model Building

After imputing missing values and executing Principle Component Analysis on the numerical data columns, the categorical data was transformed into dummy variable columns using Pandas' get_dummies() feature.
The data was split into training (70%) and testing (30%) data.
I tried two different models:

Random Forest Regressor: For baseline model
Multi Layer Perceptron Regressor (MLPReggresor): Since the data involved feature values of different ranges, I thought a Multi Layer Perceptron model will be resistent to those variations

Since the problem statement dictates us to predict multiple values, MultiOutputRegressor is used.

y_test (blue) vs. y_pred (orange) for first 500 data points in test data

MLP Regressor	Random Forest Regressor

Hyperparameter Tuning

As seen in the graphs above, the prediction lined for Random Forest Regressors are mostly flat lines with a few sparse peaks. While on the contrary, Multi-level Perceptron Regressor shows way better results. Thus only Multi-level Perceptron Regressor underwent hyperparameter tuning. Grid Search Cross Validation method is used to fine tune the regression model. The best model obtained after hyper parameter tuning is:

Model Evaluation

Mean Absolute Error (MAE) is used the performance metric for evaluating the regression model. MAE is easy to interpret and provides a clear view of the performance of the model.

The Mean Absolute Error of the model = 0.01366

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
images		images
Final code.ipynb		Final code.ipynb
Initial code.ipynb		Initial code.ipynb
README.md		README.md
models.zip		models.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

Final code.ipynb

Final code.ipynb

Initial code.ipynb

Initial code.ipynb

README.md

README.md

models.zip

models.zip

Repository files navigation

The Winton Stock Market Challenge

tl;dr

Data

Technologies Used

Exploratory Data Analysis

Feature Engineering

Model Building

y_test (blue) vs. y_pred (orange) for first 500 data points in test data

MLP Regressor

Random Forest Regressor

Hyperparameter Tuning

Model Evaluation

About

Releases

Packages

Languages

chawla201/The-Winton-Stock-Market-Challenge

Folders and files

Latest commit

History

Repository files navigation

The Winton Stock Market Challenge

tl;dr

Data

Technologies Used

Exploratory Data Analysis

Feature Engineering

Model Building

y_test (blue) vs. y_pred (orange) for first 500 data points in test data

MLP Regressor

Random Forest Regressor

Hyperparameter Tuning

Model Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Languages