Skip to content

kanevundi/Tanzania-Wells-Prediction-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TANZANIA WATER WELLS

1684951756320

Written by Kane Vundi

Phase 3 Project

The full project description can be found here.

Overview

Access to clean and safe drinking water is a fundamental human right and a critical aspect of sustainable development. In many regions of the world, including Tanzania, millions of people still lack access to reliable water sources, leading to numerous health, economic, and social challenges. Addressing this issue requires effective planning and resource allocation to ensure the sustainable provision of clean water to communities in need.

In the context of Tanzania, where water scarcity and inadequate infrastructure remain persistent problems, predictive modeling emerges as a promising approach to enhance water resource management and optimize the drilling of new wells. By leveraging historical data, advanced statistical techniques, and machine learning algorithms, we aim to develop a robust predictive model that can accurately estimate the potential success and yield of water wells across different regions in Tanzania.


1. Business Understanding

Problem Statement

Despite efforts to improve water access in Tanzania, significant challenges persist in ensuring sustainable and reliable access to clean water for all communities. The lack of accurate predictive models for water wells hampers efficient planning and resource allocation, resulting in suboptimal drilling locations, unreliable well yields, and inadequate maintenance strategies. As a result, communities continue to face water scarcity, health risks from contaminated water sources, and economic hardships.

The existing approach to drilling water wells in Tanzania relies heavily on experience and local knowledge, which often leads to inconsistent results and ineffective resource utilization. Moreover, the scarcity of reliable data and the complex interplay of geological, hydrological, and climatic factors pose additional barriers to identifying suitable well locations and estimating their potential yield accurately.

Objectives

  • Utilize advanced statistical techniques and machine learning algorithms to develop a robust predictive model for water wells in Tanzania. The model should accurately estimate the potential success rate and yield of water wells based on the identified influential factors.
  • Conduct a thorough analysis of the dataset to identify the key factors that significantly impact the functionality of water wells in Tanzania.

Research Questions

  • Which statistical techniques and machine learning algorithms are most suitable for building an accurate predictive model for water wells in Tanzania?
  • How can the predictive model be validated and refined to ensure its reliability and practicality?

Success Criteria

Business Success Criteria

  • The predictive modeling project should lead to a measurable increase in the success rate of water well installations in Tanzania.
  • The project should correctly identify functionality of a well and determine its viability.
  • The success of the project should be measured by the satisfaction and adoption of the predictive model by key stakeholders, including policymakers, government agencies, and humanitarian organizations.
  • The project should contribute to the long-term sustainability of water access in Tanzania. This can be assessed by evaluating the scalability and replicability of the predictive model.

Project Success Criteria

Generating a model that will be able to correctly predict the quality status of the wells in Tanzania with an accuracy of 80%.


2. Data Understanding

In this project we shall use a dataset containing information about existing water wells in Tanzania sourced from an ongoing DrivenData competition.

2.1 Data Description

The dataset contains 59,400 records and spans 40 columns. Of these columns, we identified 31 to be categorical, and 9 as numerical. We were able to further group the columns into the general features being captured. A description of all features can be found here.


3. Data Preparation

This dataset analysis and classification will help our stakeholders to improve maintenance of the present water wells or give useful information for future wells.

3.1 Data Selection

I analyzed the data to identify the most relevant features that significantly influence the functinality of water wells. I achieved this by exploratory data analysis, correlation analysis, domain knowledge and selected a subset of features that exhibit strong relationships with well outcomes and discard irrelevant or redundant variables.

3.2 Data Cleaning

A number of the features overlap, including waterpoint_type and waterpoint_type_group as well as extraction_type, extraction_type_group, and extraction_type_class. Due to the same information being represented at several levels, these would cause multicollinearity concerns, hence in order to decrease the dimensionality of the data, the most generic columns for each overlapped column were retained.

Checking for validity in the data, the dataset was checked for any duplicated values and outliers. The duplicated records were not dropped as it does not mean that they were similar wells but just built under the same project. We also chose not to drop the outliers, as it did not display erroneous data but will be further looked into in the analysis section. After cleaning the dataset, we need to bring uniformity by formatting and the columns to be readable and easily interpretable. So we defined functions to make these possible. Moreover, we verified that the values of various columns are consistent. The names of installers and funders had numerous instances of the same funder and/or installer, but with misspelt names and/or typographical errors.


4. Modeling

The "status_group" variable was my target I changed it from ternery into a binary format, typically indicating whether a water well is functional or non-functional, the categories "non- functional" and "functional needs repair" can be combined into a single category representing non-functional wells, while "functional" remains unchanged. I Label-Encoded the categorical variables and categorised the target. Our baseline model was Dummy Classifier with the StandardScaler as our scaler, just to test how our chosen attributes would perform on a basic level. We then explored other more sophisticated models like the Decision trees, K-Nearest -Neighbours, SVM and Random Forest.


5. Evaluation

I used pipelines to scale the data and then fit it to the model. We then used cross validation to ensure that our model was not overfitting. We also used a confusion matrix to evaluate the performance of our mode. Our model performed moderately well predicting at most 78% of the functionality.


6. Conclusion

The accuracy of my random forest classifier and KNearest Neighbors Classifier was 78% and 77% respectively. While it is still a good predictive model, I would like to undertake further feature engineering to boost the recall scores if I had more time. I achieved our bjectives to be able to predict the functional wells and had a conclusive accuracy score.

7. Recommendations

The Lake Rukwa basin area, where there are considerably more non-functional wells than functional ones, may be worth considering for our stakeholders when they decide to build more wells in Tanzania. There are more non-functional wells than functioning ones in the Dodoma region; this situation has to be investigated. Over the course of time, wells with operating permission typically become more viable and useful than those without. Due to possible public misuse, unpaid wells frequently become inoperable; perhaps creating a reasonable payment plan will help stop this. Because wells without permissions also have a higher likelihood of becoming inoperable, our stakeholder must verify that they have permits to ensure that they are acceptable.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published