Skip to content

Latest commit

 

History

History
138 lines (118 loc) · 7.08 KB

Stat-Regression.md

File metadata and controls

138 lines (118 loc) · 7.08 KB

Regression Analysis - Statistics

Overview

  • Important assumptions

    • linear and adaptive relationship btw dependent (response) variable and independent (predictor) variables
    • no autocorrelation: no correlation btw residual (error) terms
    • not multicolinearity: independent variables not correlated
    • homoscedasticity: error terms w/ constant variance
    • normally distributed error terms
  • Types of regression

    • ANOVA - Analysis of Variance: a kind of regression in which the X variable or variables are categorical rather than numeric
    • Analysis of Covariance: used when there is at least one numeric (continuous or interval) variable mixed in with one or more categorical Independent Variables
    • Logistic Regression: a dichotomous, or binary variable
    • Multinomial Logistic Regression/Logit Model: categorical dependent variable w/ more than two categories
    • Probit model: based on the cumulative normal distribution
    • Ordinal Logistic / Probit Regression: categorical dependent variable w/ ordered categories
    • Poisson and Negative Binomial Regression: countable dependent variable
    • Quantile regression
    • Box-Cox regression
    • Truncated and censored regression
    • Hurdle regression
    • Nonparameteric regression
    • Regression methods for time-series
    • regression methods for longitudinal

Assumption Violation and Solutions

  • Linear and Additive

    • fit a linear model to a non-linear, non-additive data set $\to$ fail to capture the trend mathematically
    • resulting in an inefficient model
    • erroneous predictions on an unseen data set
    • Check:
      • residual vs fitted value plots
      • including polynomial terms ($X, X^2, X^3, \dots$) to capture the non-linear effect
  • Autocorrelation

    • presence of correlation in error terms drastically reduces model’s accuracy
    • time series models: the next instance depending on previous instance
    • correlated error terms: underestimate the true standard error
      • narrower confidence interval: a 95% confidence interval lesser probability than 0.95 to contain the actual value of coefficients
      • narrower prediction intervals
    • lower standard errors $\to$ lower p-values $\to$ incorrect conclusion for a parameter as statistically significant
    • Check:
      • Durbin-Watson (DW) statistic: $DW \in (0, 4)$
        • $DW = 2 \implies$ no autocorrelation
        • $DW \in (0, 2) \implies$ positive autocorrelation
        • $DW \in (2, 4) \implies$ negative autocorrelation
      • residual vs time plot: observing the seasonal or correlated pattern in residual values
  • Multicollinearity

    • the independent variables moderately or highly correlated
    • model with correlated variables
      • difficult to figure out the true relationship of a predictors with response variable
      • hard to find out which variable actually contributing to predict the response variable
    • correlated predictors
      • larger standard errors $\to$ wider confidence interval $\to$ less precise estimates of slope parameters
      • estimated regression coefficient: depending on other predictors in the model
      • incorrect conclusion: a variable strongly / weakly affecting target variable
      • changing the estimated regression coefficients as a correlated variable drops off
    • Check:
      • scatter plot to visualize correlation effect among variables
      • VIF factor
        • $VIF \le 4$: no multicollinearity
        • $VIF \ge 10$: serious multicollinearity
      • correlation table
  • Heteroscedasticity

    • non-constant variance in the error terms
    • arising in presence of outliers or extreme leverage values
    • disproportionately influences the model’s performance
    • confidence interval for out of sample prediction tends to be unrealistically wide or narrow
    • Check:
      • residual vs fitted values plot: exhibit a funnel shape pattern
      • Breusch-Pagan / Cook–Weisberg test or White general test
  • Normal distribution of error terms

    • non-normally distributed:
      • CI too wide or narrow
      • a few unusual data points $\to$ investigate closely to make a better model
    • unstable CI: difficulty in estimating coefficients based on minimization of least squares
    • Check:
      • Q-Q plot
      • statistical tests of normality, including Kolmogorov-Smirnov test, Shapiro-Wilk test

Regression Plots

  • Residual vs Fitted Values

    • scatter plot:
      • the distribution of residuals (errors) vs fitted values (predicted values)
      • various useful insights including outliers
      • outliers: labeled by observation number to make them easy to detect
    • key points
      • existence of any patterns:
        • signs of non-linearity in the data
        • model not capturing non-linearity
      • funnel shaped: sign of non constant variance, i.e. heteroscedasticity
    • solution
      • executing a non-linear transformation, such as $\log(x)$, $\sqrt{x}$, or $x^2$
      • overcome heteroscedasticity:
        • transforming variable such as $\log(Y)$ or $\sqrt{Y}$
        • weighted least square method
  • Normal Q-Q Plot

    • a scatter plot to validate the assumption of normal distribution in a data set
    • normal distribution: points shown fairly straight line
    • non-normality: deviation in the straight line
    • quantile:
      • points in data below a certain proportion of data falls
      • often referred to as percentiles
      • e.g., value of 50th percentile = 120 $\implies$ half og the data lies below 120
    • solution: non-linear transformation of variables (response or predictors)
  • Scale Location Plot

    • used to detect homoscedasticity
    • how the residual spreading along the range of predictors
    • similar to residual vs fitted value plot except using standardized residual values
    • normally distributed: no discernible pattern in the plot
    • solution: (same as Residual vs Fitted Values for heteroscedasticity)
      • transforming variable such as $\log(Y)$ or $\sqrt{Y}$
      • weighted least square method
  • Residuals vs Leverage Plot

    • known as Cook’s Distance plot
    • Cook’s distance: identifying the points more influence than other points
    • influential points: a sizable impact of the regression line
    • adding or removing such points from the model able to completely change the model statistics
    • influential points =? outliers: investigating the data required
    • solution: influential point = outlier
      • removing those data if not many
      • treating as missing values or scale down