Skip to content

Tracking, notes and programming snippets while learning predictive analytics

Notifications You must be signed in to change notification settings

JasonMDev/learning-python-predictive-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predictive Analytics with Python

These are my notes from working through the book Learning Predictive Analytics with Python by Ashish Kumar and published on Feb 2016.

General

###Chapter 1: Getting Started with Predictive Modelling

  • Installed Anaconda Package.
  • Python3.5 has been installed.
  • Book follows python2, so some codes is modified along the way for python3.

###Chapter 2: Data Cleaning

  • Reading the data: variations and examples
  • Data frames and delimiters.

####Case 1: Reading a dataset using the read_csv method

  • File: titanicReadCSV.py
  • File: titanicReadCSV1.py
  • File: readCustomerChurn.py
  • File: readCustomerChurn2.py
  • File: changeDelimiter.py

####Case 2: Reading a dataset using the open method of Python

  • File: readDatasetByOpenMethod.py

####Case 3: Reading data from a URL

  • Modified the code that it works and prints out line by line dictionary of the dataset.
  • File: readURLLib2Iris.py
  • File: readURLMedals.py

####Case 4: Miscellaneous cases

  • File: readXLS.py
  • Created the file above to read from both .xls an .xlsx

####Basics: Summary, dimensions, and structure

  • File: basicDataCheck.py
  • Created the file above to read from both .xls an .xlsx

####Handling missing values

  • File: basicDataCheck.py
  • RE: Treating missing data like NaN or None
  • Deletion orr imputaion

####Creating dummy variables

  • File: basicDataCheck.py
  • Split into new variable 'sex_female' and 'sex_male'
  • Remove column 'sex'
  • Add both dummy column created above.

####Visualizing a dataset by basic plotting

  • File: plotData.py
  • Figure file: ScatterPlots.jpeg
  • Plot Types: Scatterplot, Histograms and boxplots

###Chapter 3: Data Wrangling ####Subsetting a dataset

  • Selecting Columns
  • File: subsetDataset.py
  • Selecting Rows
  • File: subsetDatasetRows.py
  • Selecting a combination of rows and columns
  • File: subsetColRows.py
  • Creating new columns
  • File: subsetNewCol.py

####Generating random numbers and their usage

  • Various methods for generating random numbers
  • File: generateRandomNumbers.py
  • Seeding a random number
  • File: generateRandomNumbers.py
  • Generating random numbers following probability distributions
  • File: generateRandomProbDistr.py
  • Probability density function: PDF = Prob(X=x)
  • Cumulative density function: CDF(x) = Prob(X<=x)
  • Uniform distribution: random variables occur with the same (uniform) frequency/probability
  • Normal distribution: Bell Curve and most ubiquitous and versatile probability distribution
  • Using the Monte-Carlo simulation to find the value of pi
  • File: calcPi.py
  • Geometry and mathematics behind the calculation of pi
  • Generating a dummy data frame
  • File: generateDummyDataFrame.py

####Grouping the data – aggregation, filtering, and transformation

  • File: groupData.py
  • Grouping
  • Aggregation
  • Filtering
  • Transformation
  • Miscellaneous operations

####Random sampling – splitting a dataset in training and testing datasets

  • File: splitDataTrainTest.py
  • Method 1: using the Customer Churn Model
  • Method 2: using sklearn
  • Method 3: using the shuffle function

####Concatenating and appending data

  • File: concatenateAndAppend.py
  • File: appendManyFiles.py

####Merging/joining datasets

  • File: mergeJoin.py
  • Inner Join
  • Left Join
  • Right Join
  • An example of the Inner Join
  • An example of the Left Join
  • An example of the Right Join
  • Summary of Joins in terms of their length

###Chapter 4: Statistical Concepts for Predictive Modelling ####Random sampling and central limit theorem ####Hypothesis testing

  • Null versus alternate hypothesis
  • Z-statistic and t-statistic
  • Confidence intervals, significance levels, and p-values
  • Different kinds of hypothesis test
  • A step-by-step guide to do a hypothesis test
  • An example of a hypothesis test

####Chi-square testing ####Correlation

  • File: linearRegression.py
  • File: linearRegressionFunction.py
  • Picture: TVSalesCorrelationPlot.png
  • Picture: RadioSalesCorrelationPlot.png
  • Picture: NewspaperSalesCorrelationPlot.png

###Chapter 5: Linear Regression with Python ####Understanding the maths behind linear regression

  • Linear regression using simulated data
  • File: linearRegression.py
  • Picture: CurrentVsPredicted1.png
  • Picture: CurrentVsPredictedVsMean1.png
  • Picture: CurrentVsPredictedVsModel1.png

####Making sense of result parameters

  • File: linearRegression.py
  • p-values
  • F-statistics
  • Residual Standard Error (RSE)

####Implementing linear regression with Python

  • File: linearRegressionSMF.py
  • Linear regression using the statsmodel library
  • Multiple linear regression
  • Multi-collinearity: sub-optimal performance of the model
  • Variance Inflation Factor
  • It is a method to quantify the rise in the variability of the coefficient estimate of a particular variable because of high correlation between two or more than two predictor variables.

####Model validation

  • Training and testing data split
  • File: linearRegressionSMF.py
  • Linear regression with scikit-learn
  • File: linearRegressionSKL.py
  • Feature selection with scikit-learn
  • Recursive Feature Elimination (RFE)
  • File: linearRegressionRFE.py

####Handling other issues in linear regression

  • Handling categorical variables
  • File: linearRegressionECom.py
  • Transforming a variable to fit non-linear relations
  • File: nonlinearRegression.py
  • Picture: MPGVSHorsepower.png
  • Picture: MPGVSHorsepowerVsLine.png
  • Picture: MPGVSHorsepowerModels.png
  • Handling outliers
  • Other considerations and assumptions for linear regression

###Chapter 6: Logistic Regression with Python ####Linear regression versus logistic regression ####Understanding the math behind logistic regression

  • File: logisticRegression.py
  • Contingency tables
  • Conditional probability
  • Odds ratio
  • Moving on to logistic regression from linear regression
  • Estimation using the Maximum Likelihood Method
  • Building the logistic regression model from scratch
  • File: logisticRegressionScratch.py
  • Read above again.
  • Making sense of logistic regression parameters
  • Wald test
  • Likelihood Ratio Test statistic
  • Chi-square test
  • [x]

####Implementing logistic regression with Python

  • File: logisticRegressionImplementation.py
  • Processing the data
  • Data exploration
  • Data visualization
  • Creating dummy variables for categorical variables
  • Feature selection
  • Implementing the model

####Model validation and evaluation

  • File: logisticRegressionImplementation.py
  • Cross validation

####Model validation

  • File: logisticRegressionImplementation.py
  • The ROC curve {see terms}

###Chapter 7: Clustering with Python ####Introduction to clustering – what, why, and how?

  • What is clustering?
  • How is clustering used?
  • Why do we do clustering?

####Mathematics behind clustering

  • Distances between two observations
  • Euclidean distance
  • Manhattan distance
  • Minkowski distance
  • The distance matrix
  • Normalizing the distances
  • Linkage methods
  • Single linkage
  • Compete linkage
  • Average linkage
  • Centroid linkage
  • Ward's method uses ANOVA method
  • Hierarchical clustering
  • K-means clustering
  • File: kMeanClustering.py

####Implementing clustering using Python

  • File: clusterWine.py
  • Importing and exploring the dataset
  • Normalizing the values in the dataset
  • Hierarchical clustering using scikit-learn
  • K-Means clustering using scikit-learn
  • Interpreting the cluster

####Fine-tuning the clustering

  • The elbow method
  • Silhouette Coefficient

###Chapter 8: Trees and Random Forests with Python ####Introducing decision trees

  • A decision tree

####Understanding the mathematics behind decision trees

  • Homogeneity
  • Entropy
  • Information gain
  • ID3 algorithm to create a decision tree
  • Gini index
  • Reduction in Variance
  • Pruning a tree
  • Handling a continuous numerical variable
  • Handling a missing value of an attribute

####Implementing a decision tree with scikit-learn

  • File: decisionTreeIris.py
  • Visualizing the tree
  • Picture: dtree2.png
  • File: dtree2.dot
  • Cross-validating and pruning the decision tree

####Understanding and implementing regression trees

  • File: regressionTree.py
  • Regression tree algorithm
  • Implementing a regression tree using Python

####Understanding and implementing random forests

  • File: randomForest.py
  • The random forest algorithm
  • Implementing a random forest using Python
  • Why do random forests work?
  • Important parameters for random forests

###Chapter 9: Best Practices for Predictive Modelling ####Best practices for coding

  • Commenting the codes
  • Defining functions for substantial individual tasks
  • Example 1
  • Example 2
  • Example 3
  • Avoid hard-coding of variables as much as possible
  • Version control
  • Using standard libraries, methods, and formulas

####Best practices for data handling

####Best practices for algorithms

####Best practices for statistics

####Best practices for business contexts

About

Tracking, notes and programming snippets while learning predictive analytics

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages