Skip to content

FGBASTANTE/Residuals_Matter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo_

Residuals_Matter

This notebook provides a practical walkthrough of the essential components of Exploratory Data Analysis and Predictive Modelling. The techniques showcased are fundamental for transforming raw data into actionable insights.

Fernando García Bastante
Universidade de Vigo
For Educational Purposes


🎯 Objective

  • The goal of this notebook is to demonstrate how to apply Python tools to analyze, transform, and model data from scratch. It covers everything from initial cleaning to benchmarking predictive models including linear regression, penalized regressions, PCA, PLS, and random forest.

🧪 Requirements

  1. Make sure Python 3.12 is installed. V.g. with conda:
    conda create -n myeda_env python=3.12
    conda activate myeda_env
  2. Install dependencies:
    conda install conda-forge::ipython scikit-learn pingouin pandas matplotlib cython openpyxl seaborn jupyterlab tabulate statsmodels graphviz python-graphviz pydot shap ipywidgets
  3. Launch Jupyter lab
    jupyter lab
    ...and load the file: eda_dm.ipynb

🧰 Techniques

  • Linear/Lasso/Ridge regressions
  • RFECV, PCA, PLS
  • Random Forest
  • Cross-Validation
  • SHAP

image DataBase

The database employed to illustrate the tools and techniques presented in this Jupyter Notebook is derived from the distinguished article: Chemical Descriptors for a Large-Scale Study on Drop-Weight Impact Sensitivity of High Explosives. This study investigates the relationship between the results of the drop-weight impact test—used to evaluate the handling sensitivity of high explosives—and a compendium of molecular and chemical descriptors associated with the explosives under examination.

Frank W. Marrs, Jack V. Davis, Alexandra C. Burch, Geoffrey W. Brown, Nicholas Lease, Patricia L. Huestis, Marc J. Cawkwell, and Virginia W. Manner (2023). Chemical Descriptors for a Large-Scale Study on Drop-Weight Impact Sensitivity of High Explosives. Journal of Chemical Information and Modeling.
https://pubs.acs.org/doi/10.1021/acs.jcim.2c01154

DISCLAIMER: This code is provided for educational and demonstrative purposes only. Its sole objective is to illustrate Python techniques for data visualisation and analysis. The datasets used in the examples serve purely as illustrative material; no comprehensive or contextual analysis of these specific datasets has been undertaken or is implied. The primary focus remains on the implementation of technical methodologies, rather than the in-depth interpretation of the data itself. For the purposes of this notebook, minor modifications have been introduced into the database in order to facilitate the illustration of certain techniques presented herein.


📑 Notebook Structure

  1. Data Loading and Cleaning
  2. Variable Transformation
  3. Exploratory Data Analysis
  4. Linear Regression with statsmodels
  5. Stepwise Regression via AIC/BIC
  6. Regression with scikit-learn and Cross-Validation
  7. Model Comparison
  8. Principal Component Regression (PCR)
  9. Partial Least Squares Regression (PLS)
  10. Random Forest Regression
  11. Shap (SHapley Additive exPlanations)
  12. Tree Visualization with graphviz

📚 Recommended Bibliography and Resources (in English)

  1. Pandas Documentation – Data Manipulation
    Official reference for pandas, the core library used for data loading, cleaning, and transformation.

  2. Seaborn: Statistical Data Visualization
    Guide to Seaborn's functionality, including correlation plots, pairplots, and heatmaps for EDA.

  3. Scikit-Learn User Guide
    Covers regression models, feature selection (SelectKBest, RFE), pipelines, cross-validation, PCA, and more.

  4. Statsmodels Documentation
    Useful for linear models, OLS diagnostics, VIF analysis, and statistical hypothesis testing.

  5. Hands-On Exploratory Data Analysis with Python
    Practical guide to cleaning, analyzing, and visualizing datasets using NumPy and Pandas.

  6. SHAP documentation Approach to explain the output of any machine learning model.


📘 Summary of Workflow

This notebook exemplifies a comprehensive data science pipeline implemented using Python. It encompasses the following key stages:


1. Data Loading and Initial Appraisal

  • Seamless Data Ingestion: Demonstrates proficient use of the pandas library to import datasets.
  • Structural Inspection: Utilizes methods such as .shape, .columns, and .dtypes to examine dataset dimensions, column names, and data types—crucial for identifying structural inconsistencies.
  • Preliminary Data Integrity Checks: Functions like .head(), .tail(), and .sample() offer a visual snapshot to verify data quality and detect loading anomalies.
  • Concise Dataset Overview: The .info() method summarizes non-null counts, types, and memory usage.
  • Descriptive Statistics: The use of .describe(include='all') provides statistical summaries for both numeric and categorical features.

2. Comprehensive Data Cleansing and Preprocessing

  • Missing Data Handling: Identifies and quantifies missing values using .isnull().sum(), and explores various imputation strategies (mean, median, mode, or advanced methods).
  • Duplicate Detection and Removal: Employs .duplicated() and .drop_duplicates() to ensure data uniqueness.
  • Data Type Correction: Uses astype() to convert columns to appropriate types (e.g., numeric, datetime).
  • Inconsistency Resolution: Standardizes categorical entries and corrects formatting issues to maintain coherence.

3. Univariate Analysis

  • Individual Feature Examination: Analyses each variable in isolation to understand distribution and quality.
  • Numerical Features: Visualized using histograms, KDE plots, and box plots to reveal distribution patterns and outliers.
  • Categorical Features: Evaluated to explore class distributions.
  • Descriptive Measures: Summarizes central tendency, dispersion, and shape via statistical metrics (mean, median, IQR, etc.).

4. Bivariate and Multivariate Analysis

  • Numerical-Numerical Relationships: Uses scatter plots and correlation matrices (heatmaps) to assess linear associations.
  • Categorical-Numerical Insights: Explores grouped box plots, violin plots, and aggregated bar plots.
  • Multidimensional Exploration: Employs seaborn.pairplot() and variable encodings (hue, size) to visualize interactions across multiple dimensions.

5. Outlier Detection and Management

  • Visual and Statistical Techniques: Identifies outliers using box plots, Z-scores, and the IQR method, enabling thoughtful exclusion or treatment.

6. Feature Engineering and Transformation

  • Derived Features: Highlights opportunities to construct new features (e.g., from date fields or binning) to enrich modeling potential.
  • Scaling and Encoding: Implements standardization/normalization and categorical encoding (e.g., One-Hot, Label Encoding) as needed for downstream modeling.

7. Modeling Pipeline Preparation

  • Feature Selection: Uses correlation analysis, domain knowledge, and model-based importance to select relevant predictors.
  • Data Partitioning: Applies train-test splits and optionally cross-validation to ensure robust model evaluation.

8. Predictive Modeling

  • Regression Techniques:

    • Linear Regression: Models linear dependencies for interpretability.
    • Random Forest Regressor: Captures nonlinear relationships using ensemble learning.
  • Model Training and Prediction: Fits the model on training data and generates predictions for unseen instances.


9. Model Evaluation and Optimization

  • Performance Metrics: Assesses model accuracy via MSE, RMSE, and R².
  • Hyperparameter Tuning: Where applicable, employs GridSearch or RandomizedSearch for parameter optimization.
  • Interpretation and Insight: Relates performance to domain-specific expectations and explores model behavior.

About

A comprehensive workflow for exploratory data analysis and predictive modeling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published