Oil Production Prediction

This is a fictional project for studying purposes. The business context and the insights are not real.

1. Description of the Business Problem

Production prediction is one of the core problems in a company. The provided dataset is a set of nearby wells located in the United States and their 12 months cumulative production. The company needs a production prediction model to serve as one of the tools to support the company decisions. So, the company data scientist needs to build a model from scratch to predict production and show the manager that the model can perform well on unseen data.

The tools that were created:

Machine Learning Regression Model: Using the dataset provided by the company. A machine learning regression model was created to be used for future predictions.

The notebook used to create the model is available here.

Streamlit App for Production Prediction: The model is available on the Streamlit Cloud and can be used through the Streamlit App created. The App is available here.

2. Dataset Attributes

Attribute	Description
treatment company	The treatment company who provides treatment service.
azimuth	Well drilling direction.
md (ft)
tvd (ft)	True vertical depth.
date on production	First production date.
operator	The well operator who performs drilling service.
footage lateral length	Horizontal well section.
well spacing	Distance to the closest nearby well.
porpoise deviation	How much max (in ft.) a well deviated from its horizontal.
porpoise count	How many times the deviations (porpoises) occurred.
shale footage	How much shale (in ft) encountered in a horizontal well.
acoustic impedance	The impedance of a reservoir rock (ft/s * g/cc).
log permeability	The property of rocks that is an indication of the ability for fluids (gas or liquid) to flow through rocks.
porosity	The percentage of void space in a rock.
poisson ratio	Measures the ratio of lateral strain to axial strain at linearly elastic region.
water saturation	The ratio of water volume to pore volume.
toc	Total Organic Carbon, indicates the organic richness (hydrocarbon generative potential) of a reservoir rock.
vcl	The amount of clay minerals in a reservoir rock.
p-velocity	The velocity of P-waves (compressional waves) through a reservoir rock (ft/s).
s-velocity	The velocity of S-waves (shear waves) through a reservoir rock (ft/s).
youngs modulus	The ratio of the applied stress to the fractional extension (or shortening) of the reservoir rock parallel to the tension (or compression) (giga pascals).
isip	When the pumps are quickly stopped, and the fluids stop moving, these friction pressures disappear and the resulting pressure is called the instantaneous shut-in pressure, ISIP.
breakdown pressure	The pressure at which a hydraulic fracture is created/initiated/induced.
pump rate	The volume of liquid that travels through the pump in a given time.
total number of stages	Total stages used to fracture the horizontal section of the well.
proppant volume	The amount of proppant in pounds used in the completion of a well (lbs).
proppant fluid ratio	The ratio of proppant volume/fluid volume (lbs/gallon).
production	The 12 months cumulative gas production (mmcf).

3. Solution Strategy

Understand the Business problem.
Clean the dataset removing outliers, NA values and unnecessary features.
Explore the data to create hypothesis, think about a few insights and validate them.
Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
Create the models using machine learning algorithms.
Evaluate the created models to find the one that best fits to the problem.
Tune the model to achieve a better performance.
Deploy the model in production so that it is available to other people.
Find possible improvements to be explored in the future.

4. The Insights

I1: Wells with a greater number of stages produce more,

True: This relationship doesn't apply for all values of total number of stages, but it tends to be true.

I2: Wells that started producing longer ago produce less.

True: Productions from newer wells are better.

I3: Wells that are farther from the others produce more.

False: The production doesn't increase according to the distance from other wells.

I4: Wells in which more proppant were used produce more.

True: More proppant indicates a greater production.

I5: Wells in which the rocks have higher values of porosity produce more.

False: More porosity does not mean more production.

5. Machine Learning Modeling

The final result of this project is a regression model. Therefore, some machine learning models were created. So, 7 models were created, Linear Regression, Lasso, SVM, Random Forest, XGBoost, LightGBM and CatBoost.

Boruta (feature selection algorithm) was used to select features for the model and 11 features were selected to the final model. The models were evaluated considering three metrics, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). The initial models performances are in the table below.

Model Name	MAE	MAPE	RMSE
CatBoost	502.93	0.2817	781.34
LightGBM	522.03	0.2936	806.55
XGBoost	535.10	0.3094	813.48
Random Forest	564.38	0.3281	852.23
SVM	648.01	0.4468	931.77
Linear Regression	679.33		1012.51
Lasso	1018.08	0.4259	1396.98

6. Final Model

To decide which would be the final model, a cross-validation was carried out to evaluate the performance of the algorithms in a more robust way. These metrics are represented in the table below.

Model Name	MAE	MAPE	RMSE
Linear Regression	687.8 +/- 49.40	0.49 +/- 0.04	974.12 +/- 90.88
Lasso	1023.65 +/- 61.45	0.89 +/- 0.06	1348.19 +/- 96.97
SVM	651.62 +/- 28.27	0.51 +/- 0.06	897.34 +/- 60.87
Random Forest	521.82 +/- 26.99	0.36 +/- 0.02	768.7 +/- 74.63
XGBoost	526.78 +/- 14.36	0.35 +/- 0.02	773.11 +/- 52.73
LightGBM	525.71 +/- 31.97	0.34 +/- 0.02	767.4 +/- 58.25
CatBoost	490.18 +/- 16.5	0.32 +/- 0.02	724.79 +/- 54.17

As the table presents, the Catboost model was the best one and was chosen to be deployed. After choosing which would be the final model, a random search hyperparameter optimization algorithm was used to improve the performance of the model. The final model evaluation metrics are in the table below.

Model Name	MAE	MAPE	RMSE
CatBoost Tuned	485.66 +/- 23.01	0.32 +/- 0.02	714.4 +/- 64.6

7. Conclusion

Although the dataset has many features, it is small and has a significant amount of missing values. The model presented a larger error than expected, this problem could be circumvented with a larger amount of data. Using the app, other people can easily make predictions just setting the values and pressing the prediction button.

8. Future Work

Find a better way to replace missing values.
Find the best way of dealing with the outliers.
Search for models that could perform better with this small dataset.
Try some dimensionality reduction algorithm to improve the model prediction capabilities.
Improve the Streamlit app adding more functions.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
feature_transformation		feature_transformation
image		image
model		model
notebook		notebook
.gitignore		.gitignore
README.md		README.md
prediction_app.py		prediction_app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

feature_transformation

feature_transformation

image

image

model

model

notebook

notebook

.gitignore

.gitignore

README.md

README.md

prediction_app.py

prediction_app.py

requirements.txt

requirements.txt

Repository files navigation

Oil Production Prediction

1. Description of the Business Problem

The tools that were created:

2. Dataset Attributes

3. Solution Strategy

4. The Insights

5. Machine Learning Modeling

6. Final Model

7. Conclusion

8. Future Work

About

Releases

Packages

Languages

m4theus4ndr4de/regression-oil-production-prediction

Folders and files

Latest commit

History

Repository files navigation

Oil Production Prediction

1. Description of the Business Problem

The tools that were created:

2. Dataset Attributes

3. Solution Strategy

4. The Insights

5. Machine Learning Modeling

6. Final Model

7. Conclusion

8. Future Work

About

Topics

Resources

Stars

Watchers

Forks

Languages