Predicting Malaria in Blood Cells

Important Note: The main part of this notebook, Predicting_Malaria.ipynb was run and executed in Google Colaboratory. There are instructions to recreate the notebook in that file.

Business Understanding

According to the latest World Malaria Report published by the World Health Organization, there were 247 million cases of malaria in 2021 compared to 245 million cases in 2020. The estimated number of malaria deaths stood at 619,000 in 2021 compared to 625,000 in 2020. An early diagnosis and subsequently early treatment of malaria will help doctors practicing in areas with high rates of malaria infection and malaria deaths. Four African countries accounted for over half of all malaria deaths worldwide: Nigeria (31.3%), the Democratic Republic of the Congo (12.6%), United Republic of Tanzania (4.1%) and Niger (3.9%).

Business Problem

The WHO | Regional Office for Africa recognizes that malaria is going undiagnosed and subsequently untreated in areas where the parasite is prevalent and the resources to diagnose and treat it are the lowest. The WHO wants to create a model that can accurately predict whether or not a cell from stained blood smear is infected with malaria in order to more effectively diagnose and treat malaria in the population.

(Auxiliary data exploration can be found in this notebook.)

This application can save lives. According to the CDC: in an ideal situation malaria treatment should not be initiated until the diagnosis has been established by laboratory testing. “Presumptive treatment”, i.e., without prior laboratory confirmation, should be reserved for extreme circumstances, such as strong clinical suspicion of severe disease in a setting where prompt laboratory diagnosis is not available. Doctors will still be needed to take blood and provide treatments. Histologists will still be required to prepare slides and confirm the diagnoses. This technology will simply make their operations more effiecient and allow them to dianose and treat more patients.

Diagnosis

Malaria parasites can be identified by examining under the microscope a drop of the patient’s blood, spread out as a “blood smear” on a microscope slide. Prior to examination, the specimen is stained (most often with the Giemsa stain) to give the parasites a distinctive appearance. This technique remains the gold standard for laboratory confirmation of malaria. However, it depends on the quality of the reagents, of the microscope, and on the experience of the laboratorian.

In the case of identifying cells parasitized by malaria, the Giemsa stain is particularly useful because the stain binds to the parasite's chromatin and makes it stand out under a microscope.

Cost of Errors

The CDC states that Malaria must be recognized promptly in order to treat the patient in time and to prevent further spread of infection in the community via local mosquitoes. Malaria should be considered a potential medical emergency and should be treated accordingly. Delay in diagnosis and treatment is a leading cause of death in malaria patients in the United States.

When considering the diagnosis of malaria, false negatives are more costly than false positives for a few reasons:

Treatment is relatively cheap (USD $3-6 as of 2013)
Side effects are minimal
Undiagnosed malaria can lead community transmission and eventually to death

Recall will be a very important metric when evaluating the models as the goal is minimizing false negatives.

Exploring Data

The data originally comes from the National Institute of Health's National Library of Medicine (NLM - NIH). It can be found at TensorFlow or Kaggle. The data consists of 27,558 cell images with equal instances of parasitized and uninfected cells from the thin blood smear slide images of segmented cells. Having equal samples is important in the training of this model to avoid class bias in predictions generated by the model.

Note: I have constructed smaller datasets to require less processing power while running the notebook. These datasets also have equal instances of parasitized and uninfected cells.

I have also brought in auxiliary data that is not used in the modeling process. It's used to generate statistics and visualizations about malaria cases and deaths from around the world. This data is provided by the WHO and can be found in the following places:

Data Preprocesses & Augmentation

Resizing images normalizes the input sizes which will regularize the training process while rescaling images helps the CNN to learn more effectively.

Using this data augmentation will help avoid overfitting by creating unseen training examples from the existing ones, thereby increasing the size of the training dataset.

Baseline Model

The data I use for this problem is evenly balanced. A baseline model, choosing all cells to 'Uninfected' results in an accuracy of 50%.

Convolutional Neural Network

I decided to build and train a Convolutional Neural Network (CNN) for this problem because it effectively learns from spatial features in images such as edges, corners, and textures. The CNN classifies the images based on these features and is typically very successful in image classification problems like this.

Model 1

Parameters

Optimizer: adam
Loss: binary crossentropy
Metrics: accuracy, false negatives
Total params: 6,479,873
Trainable params: 6,479,873
Non-trainable params: 0

Model 2

This model has the same structure but adds in a data augmentation layer which will peform a random flip and random rotation on the image.

Parameters

Optimizer: adam
Loss: binary crossentropy
Metrics: accuracy, false negatives
Total params: 6,479,873
Trainable params: 6,479,873
Non-trainable params: 0

Model 3

Parameters

Optimizer: adam
Loss: binary crossentropy
Metrics: accuracy, false negatives
Total params: 6,747,265
Trainable params: 6,744,897
Non-trainable params: 2,368

Model 4

Parameters

Optimizer: adam
Loss: binary crossentropy
Metrics: accuracy, false negatives
Total params: 1,246,305
Trainable params: 1,246,305
Non-trainable params: 0

Model 5

Back to structure of Model 2 but increasing number of epochs.

Compile

Parameters

optimizer: adam
loss: binary crossentropy
metrics: accuracy, false negatives
Total params: 6,479,873
Trainable params: 6,479,873
Non-trainable params: 0

Model 6

Parameters

optimizer: adam
loss: binary crossentropy
metrics: accuracy, false negatives
Total params: 67,373,441
Trainable params: 67,373,441
Non-trainable params: 0

Final Model

Using structure from Model 2, training on over 19,000 images. Validated with 5,500 images.

Parameters

optimizer: adam
loss: binary crossentropy
metrics: accuracy, false negatives
Total params: 6,479,873
Trainable params: 6,479,873
Non-trainable params: 0

Results

Tested with 2,700 unseen images with results of:

Accuracy:0.9655172228813171
Precision:0.9766213893890381
Recall:0.9536082744598389

Futher Exploration / Next Steps

I would like to collect more data and retrain the model.
Create a new feature for the application. This will allow the user to submit an image of an entire blood smear with many blood cells, split that image into separate images of individual cells that can be used as input to the model.
The model will now be able to deliver estimated parasitic burden which is used by clinicians to make decisions regarding treatment for malaria cases.

Conclusion

This new tool will rapidly and accurately diagnose potential cases of Malaria, estimate parasitic burden, and will allow for the early treatment of more malaria cases, greatly reducing community transmission and saving lives around the world.

├── application
│   ├── pages
│   │   ├── 2_Data_Summary.py
│   │   ├── 3_Model_Prediction.py
│   ├── model5.h5
│   └── requirements.txt
├── data
│   ├── Unseen Data
│   ├── confirmed_cases_malaria.csv
│   ├── estimated_cases_malaria.csv
│   └── estimated_deaths_malaria.csv
├── images
│   ├── conf_case_by_year.jpeg
│   ├── est_case_by_year.jpeg
│   ├── est_death_by_year.jpeg
│   ├── example_data.jpeg
│   ├── header.jpeg
│   ├── image_augmentation.jpeg
│   ├── jf.jpeg
│   ├── mal_cells.jpg
│   └── map_conf_cases.jpeg
├── .gitignore
├── LICENSE
├── Predicting_Malaria.ipynb
├── README.md
├── auxiliary_data.ipynb
└── predicting_malaria_slides.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
application		application
data		data
images		images
.gitignore		.gitignore
LICENSE		LICENSE
Predicting_Malaria.ipynb		Predicting_Malaria.ipynb
README.md		README.md
auxiliary_data.ipynb		auxiliary_data.ipynb
predicting_malaria_slides.pdf		predicting_malaria_slides.pdf

License

fetterollie/Malaria-Data-Exploration

Folders and files

Latest commit

History

Repository files navigation

Predicting Malaria in Blood Cells

Business Understanding

Business Problem

Diagnosis

Cost of Errors

Exploring Data

Data Preprocesses & Augmentation

Baseline Model

Convolutional Neural Network

Model 1

Model 2

Model 3

Model 4

Model 5

Compile

Model 6

Final Model

Results

Futher Exploration / Next Steps

Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Languages