- Build a model to predict the number of goals in a match according to match events.
- The dataset provides a granular view of 9,074 games, totaling 941,009 events from the biggest 5 European football (soccer) leagues: England, Spain, Germany, Italy, France from 2011/2012 season to 2016/2017 season.
-
events.csv contains event data about each game.
-
ginf.csv contains metadata and market odds about each game
-
dictionary.txt contains a dictionary with the textual description of each categorical variable coded with integers.
-
Event type: Corner, Foul, Substitution, Red card, Yellow card, Hand ball, Offside, etc.
-
Location: Centre of the box, Outside the box , Left side of the six yard box, Long range, etc.
-
Shoot place: Too high, Bit Too high, Bottom left corner, Top centre of the goal , etc.
-
shoot outcome: On target, OFF target, Blocked or Hit the bar.
-
Separate each event into home or away match according to data-set.
-
Missing data: we put any missing data equal -1, because data have integers and zeroes values.
-
One hot encoding Values.
-
Labels: labels will be number of goals per match (home and away matches).
-
Vectored: 5 features have been selected (6 features have been selected but I use side as anther ID to separate matches have the same ID match), In every match there 180 events (maximum number of events per match). so input is a 2-d array size (5 X 180), by vectorized it into 1-d array (1 X 900).
-
Recurrent Neural Network (RNN) as a classifier.
-
Inputs 2d-array has size (number of matches X Vectored features events).
-
Labels: numbers of goal in a match.
-
Output: prediction number of goals in this match (Float number we round it).
-
Training data: Events of 2000 matches.
-
Test data: Events of Events100 matches.
-
Accuracy: round(prediction)== label.
-
Loss function: Mean Square Error (MSE).
-
Optimizer: Adam.
-
Device: CPU.
-
Batch size: 100
-
Learning rate: 1e-3
-
Weight decay: 1e-3
Home with one-hot | Away with one-hot | Home without one-hot | Away without one-hot | |
---|---|---|---|---|
Accurcy | 1.726 % | 2.27 % | 56.017 % | 60.937 % |
Training time (minutes) | 84.06 | 83.046 | 83.076 | 80.479 |
-
One-hot encoding increases the training time and reduces accuracy significantly because when we use one-hot encoding we increase the variance between values which change the correlation between data.
-
Recommend to use LSTM.