Residuals_Matter

This notebook provides a practical walkthrough of the essential components of Exploratory Data Analysis and Predictive Modelling. The techniques showcased are fundamental for transforming raw data into actionable insights.

Fernando García Bastante
Universidade de Vigo
For Educational Purposes

🎯 Objective

The goal of this notebook is to demonstrate how to apply Python tools to analyze, transform, and model data from scratch. It covers everything from initial cleaning to benchmarking predictive models including linear regression, penalized regressions, PCA, PLS, and random forest.

🧪 Requirements

Make sure Python 3.12 is installed. V.g. with conda:

conda create -n myeda_env python=3.12
conda activate myeda_env

Install dependencies:

conda install conda-forge::ipython scikit-learn pingouin pandas matplotlib cython openpyxl seaborn jupyterlab tabulate statsmodels graphviz python-graphviz pydot shap ipywidgets

Launch Jupyter lab
```
jupyter lab
```
...and load the file: eda_dm.ipynb

🧰 Techniques

Linear/Lasso/Ridge regressions
RFECV, PCA, PLS
Random Forest
Cross-Validation
SHAP

DataBase

The database employed to illustrate the tools and techniques presented in this Jupyter Notebook is derived from the distinguished article: Chemical Descriptors for a Large-Scale Study on Drop-Weight Impact Sensitivity of High Explosives. This study investigates the relationship between the results of the drop-weight impact test—used to evaluate the handling sensitivity of high explosives—and a compendium of molecular and chemical descriptors associated with the explosives under examination.

Frank W. Marrs, Jack V. Davis, Alexandra C. Burch, Geoffrey W. Brown, Nicholas Lease, Patricia L. Huestis, Marc J. Cawkwell, and Virginia W. Manner (2023). Chemical Descriptors for a Large-Scale Study on Drop-Weight Impact Sensitivity of High Explosives. Journal of Chemical Information and Modeling.
https://pubs.acs.org/doi/10.1021/acs.jcim.2c01154

DISCLAIMER: This code is provided for educational and demonstrative purposes only. Its sole objective is to illustrate Python techniques for data visualisation and analysis. The datasets used in the examples serve purely as illustrative material; no comprehensive or contextual analysis of these specific datasets has been undertaken or is implied. The primary focus remains on the implementation of technical methodologies, rather than the in-depth interpretation of the data itself. For the purposes of this notebook, minor modifications have been introduced into the database in order to facilitate the illustration of certain techniques presented herein.

📑 Notebook Structure

Data Loading and Cleaning
Variable Transformation
Exploratory Data Analysis
Linear Regression with statsmodels
Stepwise Regression via AIC/BIC
Regression with scikit-learn and Cross-Validation
Model Comparison
Principal Component Regression (PCR)
Partial Least Squares Regression (PLS)
Random Forest Regression
Shap (SHapley Additive exPlanations)
Tree Visualization with graphviz

📚 Recommended Bibliography and Resources (in English)

Pandas Documentation – Data Manipulation
Official reference for pandas, the core library used for data loading, cleaning, and transformation.
Seaborn: Statistical Data Visualization
Guide to Seaborn's functionality, including correlation plots, pairplots, and heatmaps for EDA.
Scikit-Learn User Guide
Covers regression models, feature selection (SelectKBest, RFE), pipelines, cross-validation, PCA, and more.
Statsmodels Documentation
Useful for linear models, OLS diagnostics, VIF analysis, and statistical hypothesis testing.
Hands-On Exploratory Data Analysis with Python
Practical guide to cleaning, analyzing, and visualizing datasets using NumPy and Pandas.
SHAP documentation Approach to explain the output of any machine learning model.

📘 Summary of Workflow

This notebook exemplifies a comprehensive data science pipeline implemented using Python. It encompasses the following key stages:

1. Data Loading and Initial Appraisal

Seamless Data Ingestion: Demonstrates proficient use of the pandas library to import datasets.
Structural Inspection: Utilizes methods such as .shape, .columns, and .dtypes to examine dataset dimensions, column names, and data types—crucial for identifying structural inconsistencies.
Preliminary Data Integrity Checks: Functions like .head(), .tail(), and .sample() offer a visual snapshot to verify data quality and detect loading anomalies.
Concise Dataset Overview: The .info() method summarizes non-null counts, types, and memory usage.
Descriptive Statistics: The use of .describe(include='all') provides statistical summaries for both numeric and categorical features.

2. Comprehensive Data Cleansing and Preprocessing

Missing Data Handling: Identifies and quantifies missing values using .isnull().sum(), and explores various imputation strategies (mean, median, mode, or advanced methods).
Duplicate Detection and Removal: Employs .duplicated() and .drop_duplicates() to ensure data uniqueness.
Data Type Correction: Uses astype() to convert columns to appropriate types (e.g., numeric, datetime).
Inconsistency Resolution: Standardizes categorical entries and corrects formatting issues to maintain coherence.

3. Univariate Analysis

Individual Feature Examination: Analyses each variable in isolation to understand distribution and quality.
Numerical Features: Visualized using histograms, KDE plots, and box plots to reveal distribution patterns and outliers.
Categorical Features: Evaluated to explore class distributions.
Descriptive Measures: Summarizes central tendency, dispersion, and shape via statistical metrics (mean, median, IQR, etc.).

4. Bivariate and Multivariate Analysis

Numerical-Numerical Relationships: Uses scatter plots and correlation matrices (heatmaps) to assess linear associations.
Categorical-Numerical Insights: Explores grouped box plots, violin plots, and aggregated bar plots.
Multidimensional Exploration: Employs seaborn.pairplot() and variable encodings (hue, size) to visualize interactions across multiple dimensions.

5. Outlier Detection and Management

Visual and Statistical Techniques: Identifies outliers using box plots, Z-scores, and the IQR method, enabling thoughtful exclusion or treatment.

6. Feature Engineering and Transformation

Derived Features: Highlights opportunities to construct new features (e.g., from date fields or binning) to enrich modeling potential.
Scaling and Encoding: Implements standardization/normalization and categorical encoding (e.g., One-Hot, Label Encoding) as needed for downstream modeling.

7. Modeling Pipeline Preparation

Feature Selection: Uses correlation analysis, domain knowledge, and model-based importance to select relevant predictors.
Data Partitioning: Applies train-test splits and optionally cross-validation to ensure robust model evaluation.

8. Predictive Modeling

Regression Techniques:
- Linear Regression: Models linear dependencies for interpretability.
- Random Forest Regressor: Captures nonlinear relationships using ensemble learning.
Model Training and Prediction: Fits the model on training data and generates predictions for unseen instances.

9. Model Evaluation and Optimization

Performance Metrics: Assesses model accuracy via MSE, RMSE, and R².
Hyperparameter Tuning: Where applicable, employs GridSearch or RandomizedSearch for parameter optimization.
Interpretation and Insight: Relates performance to domain-specific expectations and explores model behavior.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
data_.xlsx		data_.xlsx
eda_dm.ipynb		eda_dm.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Residuals_Matter

🎯 Objective

🧪 Requirements

🧰 Techniques

DataBase

📑 Notebook Structure

📚 Recommended Bibliography and Resources (in English)

📘 Summary of Workflow

1. Data Loading and Initial Appraisal

2. Comprehensive Data Cleansing and Preprocessing

3. Univariate Analysis

4. Bivariate and Multivariate Analysis

5. Outlier Detection and Management

6. Feature Engineering and Transformation

7. Modeling Pipeline Preparation

8. Predictive Modeling

9. Model Evaluation and Optimization

About

Uh oh!

Releases

Packages

Languages

License

FGBASTANTE/Residuals_Matter

Folders and files

Latest commit

History

Repository files navigation

Residuals_Matter

🎯 Objective

🧪 Requirements

🧰 Techniques

DataBase

📑 Notebook Structure

📚 Recommended Bibliography and Resources (in English)

📘 Summary of Workflow

1. Data Loading and Initial Appraisal

2. Comprehensive Data Cleansing and Preprocessing

3. Univariate Analysis

4. Bivariate and Multivariate Analysis

5. Outlier Detection and Management

6. Feature Engineering and Transformation

7. Modeling Pipeline Preparation

8. Predictive Modeling

9. Model Evaluation and Optimization

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages