Skip to content

Applying PCR, Random Forest, Gradient Boosting and XGBoost on the same dataset, to predict the force of infection of the Japanese Encephalitis in various locations.

Notifications You must be signed in to change notification settings

JieSun1990/Predicting_Japanese_Encephalitis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Algorithms for Predicting the Force of Infection (FOI) of Japanese Encephalitis (JE)

Summary

This is a follow-up of the previous research done by Duy M.Nguyen and Quan M.Tran, with guidance from Dr Hannah Clapham, at the National University of Singapore.

The project's objective is to predict the FOI of Japanese Encephalitis in various regions. The previous work used Random Forest, while I explored the following methods:

  • PCR
  • Random Forest
  • Gradient Boosting
  • XGBoost

The objective of my work is to provide python code examples of the above methods for future colleagues. We will also compare the methods, explain the hyperparameter tuning and discuss the prediction interval calculation that is less commonly mentioned for machine learning.

Access the webpage version here: https://jiesun1990.github.io/Predicting_Japanese_Encephalitis/

Data

Data was generated by Duy M.Nguyen and Quan M.Tran's processing steps in Part 1, step 1-7.

Input:

  • EM_Imputed_Features_Study.Rds
  • Imputed_Features_Endemic.Rds

Location: Generate/EM_DF/ , Generate/Inputed_DF/ , in the linked repository.

Next, use the supporting function Dataframe_To_CSV.R to convert Rds files into CSV files, in R. This function will store the following result in the Generate/Python_CSV/ folder.

  • EM_Imputed_Features_Study.csv
  • Imputed_Features_Endemic.csv

We will use the above 2 files in our jupyter notebooks.

Data description:

26 X variables:

  • The bioclimatic conditions in the word, derived from rainfall and temperature on a monthly basis: 'Bio_01', 'Bio_02', 'Bio_03', 'Bio_04', 'Bio_05', 'Bio_06', 'Bio_07', 'Bio_08', 'Bio_09', 'Bio_10', 'Bio_11', 'Bio_12', 'Bio_13', 'Bio_14', 'Bio_15', 'Bio_16', 'Bio_17', 'Bio_18', 'Bio_19'
  • The Japanese Encephalitis Vector Distribution: 'VD'
  • Rice distribution: 'Rice'
  • Livestock distribution: 'Pigs'
  • Population density: 'DG_000_014bt_dens'
  • Specific age group density: 'Adjusted_Pop_Count_2015'
  • Urban/Rural category: 'UR'
  • Elevation: 'Elv'

1 Y variable:

  • Force of infection (FOI).

Note that FOI's range is between 0 to 0.5. You can understand this as a kind of probability, but there is an upper limit to the FOI in this dataset.

Detailed data description can be found here.

Method

I showed 4 methods for this problem.

Please check the respective notebook here: Notebooks/

On data thinning

The original file used the entire dataset for training. For demonstration, I have only selected 10% of the training, test and validation set for model building. You can comment the below code to run on the whole dataset, with access to HPC.

Train_Non_NA = Train_Non_NA.sample(frac = 0.1, replace = False, random_state = 1)
Validate_Non_NA = Validate_Non_NA.sample(frac = 0.1, replace = False, random_state = 1)
Test_Non_NA = Test_Non_NA.sample(frac = 0.1, replace = False, random_state = 1)

Note that with increasing data, the hyperparameters need to be re-tuned. Results may change.

On train/validation/test split

It is a convention to split data into the train, validation and test set. To recap,

  • Train: to train the model
  • Validation: to tune the parameters
  • Test: to evaluate the model performance on unseen data. We only use test data once.

In practice, many ML algorithms split the data into Train and Test only, because when tuning the parameters via grid search with cross validation, the algorithm is already splitting the X_train and Y_train into train and validation. We then use the tuned model to test on the X_test.

As a result, I did not utilise the validation set created by the original data processing. Note that if you are manually tuning the parameters one by one, you would need a separate validation set.

Depending on the size of the data, we can choose to merge the train and validation set when using grid search. However, a test set should always be prepared to evaluate the final model.

Discussion

XGBoost performed the best among the 4 methods, with the lowest test MSE. However, it is very difficult to obtain suitable prediction interval for XGBoost.

On correlated features

In general, boosting and bagging produce excellent predictions despite major correlation among variables. This is great advantage of these advanced methods for healthcare-related data science problems, where multicolinearity is common.

In this problem, a number of features are correlated, as seen from the 0Modeling/JE_PCA.ipynb.
pca.png

On feature importance

See Notebooks/JE_GradientBoosting.ipynb and Notebooks/JE_XGBoost.ipynb for a very detailed discussion on this.

In summary, decision tree models are very robust to correlated features when it comes to prediction. However, when looking at the feature importance scores,

  • in RF: some important but correlated features may have low importance score
  • in boosting: one of the pair of important but correlated features will be picked up, but not both

In addition, there are various types of feature importance. Some are MDI-based, some are permutation-based. There is no consensus on which one provides more reliable indication. There are views that permutation-based feature importance should be adopted for RF, for reader's reference.

On partial dependence plots and independent conditional expectation plots

These are measures to improve the interpretability of the model. See notebooks for more details. Example below: image

On prediction intervals

Uncertainty quantification is traditionally a statistical and mathematical focus, while machine learning is more interested in prediction. There are increasing studies on the uncertainty of emsemble models in recent years to understand why and how ML models work. I believe this is an important topic in the development of data science. At the moment, the prediction interval for random forest is more thoroughly studied than other methods. For Gradient Boosting and XGBoost, quantile regression is used. See respective notebooks for more discussion on methods used.
image

How to run

  1. Download all data & code from this google drive.
  • This folder contains all original data & code from Duy & Quan, with my adaptations and generated data/charts.
  1. In 0Modeling, I provided the code for all modeling.
  • Check JE_GradientBoosting.ipynb, JE_XGBoost.ipynb, JE_RandomForest.ipynb,JE_PCA.ipynb for a demonstration using 10% of the data. There are explanations inside for each step. You can run on your laptop.
  • To run on the entire dataset, use JE_GradientBoosting_Tuned10.py,JE_XGBoost_Tuned10.py,JE_RandomForest_Tuned10.py,JE_PCA_Full.py.
  • The hyperparameters were tuned on 10% of the data, and then applied to the whole dataset. They were not tuned on the whole training dataset because it took too much time and the outcomes were often less ideal (e.g. high test MSE - this is counterintuitive, but true in this case). The tuned parameters appear to result in stable performance for the full model.
  • The output of this step is in this folder: 0Modeling - Generate - Python_Export, separated by model.
  • The most important output for future use: Endemic_FOI_RF_Quantile Regression_Full_Cov_400.csv,Endemic_FOI_GB_Quantile Regression_Full_Cov_400.csv,Endemic_FOI_XGB_Full_Cov_400.csv(no prediction intervals), Endemic_FOI_PCR_Full_Cov_400.csv. image
  1. In 1Generate_Cases, I combined the original codes of Duy into 1 single file for each model: Combined_Generate_Cases_Jie.R. User will simply change the directory and run.
  • Before running, remember to put the above endemic FOI files in this folder: 1Generate_Cases - Data.
  • Output of this step: Cases_Age_xx.Rds files in folder 1Generate_Cases - Generate - Cases folder. Shape files in 1Generate_Cases - Generate - Cases_SHP.
  • Visualize the map for cases in each country with shp_tmap_visualizer_Ryan.R in folder 1Generate_Cases - Generate - Cases_SHP. Note that this map does not plot for within country variations.
  1. In 2Comparing_WHOIG, I combined the original codes of Duy into 1 single file for each model: Combine_Extract_Jie.R. When running this part, note that 2 directories need to be specified:
  • Set your working directory to be the folder of 2Comparing_WHOIG.
  • Specify the data folder to be 1Generate_Cases - Data. This step taks data entirely from the previous step.
  • Output of this step is saved in 2Comparing_WHOIG - Generate.
  • When checking the original code, I think that Extract_Pop_Country.R did not seem to take in the results (FOIs) we generated in step 0Modeling. In Quan's description, it mentioned that this code was to 'compare population of a country between RF and WHO-IG', however I wasn't able to understand how RF could influence the population. Or, some previous steps could be missing. Running this step will generate exactly the same result as Quan, so I have excluded this from the Combine_Extract_Jie.R.

About

Applying PCR, Random Forest, Gradient Boosting and XGBoost on the same dataset, to predict the force of infection of the Japanese Encephalitis in various locations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published