Machine Learning Algorithms for Predicting the Force of Infection (FOI) of Japanese Encephalitis (JE)
This is a follow-up of the previous research done by Duy M.Nguyen and Quan M.Tran, with guidance from Dr Hannah Clapham, at the National University of Singapore.
The project's objective is to predict the FOI of Japanese Encephalitis in various regions. The previous work used Random Forest, while I explored the following methods:
- PCR
- Random Forest
- Gradient Boosting
- XGBoost
The objective of my work is to provide python code examples of the above methods for future colleagues. We will also compare the methods, explain the hyperparameter tuning and discuss the prediction interval calculation that is less commonly mentioned for machine learning.
Access the webpage version here: https://jiesun1990.github.io/Predicting_Japanese_Encephalitis/
Data was generated by Duy M.Nguyen and Quan M.Tran's processing steps in Part 1, step 1-7.
- EM_Imputed_Features_Study.Rds
- Imputed_Features_Endemic.Rds
Location: Generate/EM_DF/ , Generate/Inputed_DF/ , in the linked repository.
Next, use the supporting function Dataframe_To_CSV.R
to convert Rds files into CSV files, in R. This function will store the following result in the Generate/Python_CSV/ folder.
- EM_Imputed_Features_Study.csv
- Imputed_Features_Endemic.csv
We will use the above 2 files in our jupyter notebooks.
26 X variables:
- The bioclimatic conditions in the word, derived from rainfall and temperature on a monthly basis: 'Bio_01', 'Bio_02', 'Bio_03', 'Bio_04', 'Bio_05', 'Bio_06', 'Bio_07', 'Bio_08', 'Bio_09', 'Bio_10', 'Bio_11', 'Bio_12', 'Bio_13', 'Bio_14', 'Bio_15', 'Bio_16', 'Bio_17', 'Bio_18', 'Bio_19'
- The Japanese Encephalitis Vector Distribution: 'VD'
- Rice distribution: 'Rice'
- Livestock distribution: 'Pigs'
- Population density: 'DG_000_014bt_dens'
- Specific age group density: 'Adjusted_Pop_Count_2015'
- Urban/Rural category: 'UR'
- Elevation: 'Elv'
1 Y variable:
- Force of infection (FOI).
Note that FOI's range is between 0 to 0.5. You can understand this as a kind of probability, but there is an upper limit to the FOI in this dataset.
Detailed data description can be found here.
I showed 4 methods for this problem.
Please check the respective notebook here: Notebooks/
The original file used the entire dataset for training. For demonstration, I have only selected 10% of the training, test and validation set for model building. You can comment the below code to run on the whole dataset, with access to HPC.
Train_Non_NA = Train_Non_NA.sample(frac = 0.1, replace = False, random_state = 1)
Validate_Non_NA = Validate_Non_NA.sample(frac = 0.1, replace = False, random_state = 1)
Test_Non_NA = Test_Non_NA.sample(frac = 0.1, replace = False, random_state = 1)
Note that with increasing data, the hyperparameters need to be re-tuned. Results may change.
It is a convention to split data into the train, validation and test set. To recap,
- Train: to train the model
- Validation: to tune the parameters
- Test: to evaluate the model performance on unseen data. We only use test data once.
In practice, many ML algorithms split the data into Train and Test only, because when tuning the parameters via grid search with cross validation, the algorithm is already splitting the X_train
and Y_train
into train and validation. We then use the tuned model to test on the X_test
.
As a result, I did not utilise the validation set created by the original data processing. Note that if you are manually tuning the parameters one by one, you would need a separate validation set.
Depending on the size of the data, we can choose to merge the train and validation set when using grid search. However, a test set should always be prepared to evaluate the final model.
XGBoost performed the best among the 4 methods, with the lowest test MSE. However, it is very difficult to obtain suitable prediction interval for XGBoost.
In general, boosting and bagging produce excellent predictions despite major correlation among variables. This is great advantage of these advanced methods for healthcare-related data science problems, where multicolinearity is common.
In this problem, a number of features are correlated, as seen from the 0Modeling/JE_PCA.ipynb.
See Notebooks/JE_GradientBoosting.ipynb and Notebooks/JE_XGBoost.ipynb for a very detailed discussion on this.
In summary, decision tree models are very robust to correlated features when it comes to prediction. However, when looking at the feature importance scores,
- in RF: some important but correlated features may have low importance score
- in boosting: one of the pair of important but correlated features will be picked up, but not both
In addition, there are various types of feature importance. Some are MDI-based, some are permutation-based. There is no consensus on which one provides more reliable indication. There are views that permutation-based feature importance should be adopted for RF, for reader's reference.
These are measures to improve the interpretability of the model. See notebooks for more details. Example below:
Uncertainty quantification is traditionally a statistical and mathematical focus, while machine learning is more interested in prediction. There are increasing studies on the uncertainty of emsemble models in recent years to understand why and how ML models work. I believe this is an important topic in the development of data science.
At the moment, the prediction interval for random forest is more thoroughly studied than other methods. For Gradient Boosting and XGBoost, quantile regression is used.
See respective notebooks for more discussion on methods used.
- Download all data & code from this google drive.
- This folder contains all original data & code from Duy & Quan, with my adaptations and generated data/charts.
- In
0Modeling
, I provided the code for all modeling.
- Check
JE_GradientBoosting.ipynb
,JE_XGBoost.ipynb
,JE_RandomForest.ipynb
,JE_PCA.ipynb
for a demonstration using 10% of the data. There are explanations inside for each step. You can run on your laptop. - To run on the entire dataset, use
JE_GradientBoosting_Tuned10.py
,JE_XGBoost_Tuned10.py
,JE_RandomForest_Tuned10.py
,JE_PCA_Full.py
. - The hyperparameters were tuned on 10% of the data, and then applied to the whole dataset. They were not tuned on the whole training dataset because it took too much time and the outcomes were often less ideal (e.g. high test MSE - this is counterintuitive, but true in this case). The tuned parameters appear to result in stable performance for the full model.
- The output of this step is in this folder:
0Modeling - Generate - Python_Export
, separated by model. - The most important output for future use:
Endemic_FOI_RF_Quantile Regression_Full_Cov_400.csv
,Endemic_FOI_GB_Quantile Regression_Full_Cov_400.csv
,Endemic_FOI_XGB_Full_Cov_400.csv
(no prediction intervals),Endemic_FOI_PCR_Full_Cov_400.csv
.
- In
1Generate_Cases
, I combined the original codes of Duy into 1 single file for each model:Combined_Generate_Cases_Jie.R
. User will simply change the directory and run.
- Before running, remember to put the above endemic FOI files in this folder:
1Generate_Cases - Data
. - Output of this step:
Cases_Age_xx.Rds
files in folder1Generate_Cases - Generate - Cases
folder. Shape files in1Generate_Cases - Generate - Cases_SHP
. - Visualize the map for cases in each country with
shp_tmap_visualizer_Ryan.R
in folder1Generate_Cases - Generate - Cases_SHP
. Note that this map does not plot for within country variations.
- In
2Comparing_WHOIG
, I combined the original codes of Duy into 1 single file for each model:Combine_Extract_Jie.R
. When running this part, note that 2 directories need to be specified:
- Set your working directory to be the folder of
2Comparing_WHOIG
. - Specify the data folder to be
1Generate_Cases - Data
. This step taks data entirely from the previous step. - Output of this step is saved in
2Comparing_WHOIG - Generate
. - When checking the original code, I think that
Extract_Pop_Country.R
did not seem to take in the results (FOIs) we generated in step0Modeling
. In Quan's description, it mentioned that this code was to 'compare population of a country between RF and WHO-IG', however I wasn't able to understand how RF could influence the population. Or, some previous steps could be missing. Running this step will generate exactly the same result as Quan, so I have excluded this from theCombine_Extract_Jie.R
.