Skip to content

kkm24132/Mentoring_Enablement

Repository files navigation

1. The Premise

What does it take to execute our "strategy" - true power knows no bounds? What does it take to pursue our passion or short burst? Education and Learning is key to success and empowerment!! With my experience around Data Science and AI, solving business problems for global clients across industries at scale, mentoring and building teams for success, defining roadmap for maturity assessments, and many more, I strongly recommend structured "thinking" and "planning" to accomplish our goals and outcomes, which can create impact and value.

The content here is focusing on the following:

  • a) My recommended 7 + 1 steps for Data Science and AI mentoring approach
  • b) Quick guideline points / kind of a checklist of what to look for across various levels (Beginner, Intermediate etc.)
  • c) Why do we need Data and AI - what are typical business use cases that we should get a feel of?
  • d) References about some key areas

2. Data Science and AI Learning Approach - My 7+1 steps

(Step -2, -1, 0, 1, 2, 3, 4 and Infinity)

Objective

Goal is to provide some thoughts, pointers around Data Science and AI for learners, data science aspirants and data science practitioners to get a direction. I get an opportunity to learn from them, my team and everyone that I interact and collaborate across around this theme.

  • Target Audience wise pointers:
    • If you are a student / fresher with no experience - Focus initially on Core, Breadth and Depth of Data Science and prioritize some key areas such as Probability, Statistics, Linear Algebra, Data Visualization, Data Munging, Data Wrangling, EDA, Fundamentals of ML methods
    • If you are a working professional with no prior DS experience - All above points + Business use case focus + SDLC process + Cloud essentials + Productionization of ML solutions + Value from Data and AI etc
    • If you are a working professional with 0-10 yrs of experience and working as a DS practitioner - You would have gone through most of these. However, please continue to focus and accelerate your understanding. More focus on DS/ML best practices, end to end implementation aspects
    • If you are a working professional with 10+ yrs of experience and working as a DS practitioner - All points + Thought Leadership
    • All DS practitioners have to maintain GitHub to some extent (varies based on roles such as Research Scientist, Principal Data Scientist, Applied Data and AI scientist, ML Engineer, Data Analyst, Data Scientist, Data Engineer, Data Journalist, Director Data Science, VP Data Science, Chief Data Scientist roles etc.)
    • For all roles, it is encouraged to have innovation mindset: Patents, Publications, Writing Blogs on forums such as Medium.com

My Recommendation

Data Science is a journey and there are no short cuts around it like everything else where we want get into success. Let me also be very clear and request some of these key points to my mentees and whoever is reading this content. The content we need depends on the objectives that we would intend to pursue in life. The foundations and interpretations may vary depending on what we want to accomplish in a specific span - a Data Science Practitioner, Applied Data Scientist, Research Scientist and so on. Every role would expect us to get into different flavours of understanding and focus.

I would like to create an outline as an initial draft approach.

3. (Step: -2): Pre-Requisites: Mindset

  • Mindset of holistic approach and experimental / iterative way of exploring methods
  • Mindset of combinatorial concepts around Mathematics + Computer Science + Statistics + Programming + Story telling
  • Mindset of Data Science Method/Approach to Success; Outcome and Impact driven objectives
  • How to think like a Computer Scientist

Focus on a business problem and trying to understand business KPIs, drivers that are required as part of goal formulation and define strategy accordingly alligning to CRISP-DM methodology from an end to end Data Science perspective. Below diagram depicts a high level outline of the approach.

plot of business problem solving using CRISP-DM

4. (Step: -1) Fundamentals of Building Blocks: Statistics, Linear Algebra, Programming, Need for ML

5. (Step: 0) DS/AI Ecosystem, Methodologies and DV/EDA

  • Methodology: Fundamentals of CRISP-DM Methodologies One must understand key aspects on CRISP-DM methodlogy, applications / use cases of Machine Learning. This repository focuses on some tutorials using open datasets. The Venn Diagram shows how specific dimensions are related.

plot of chunk crisp-dm plot of DS Venn Diagram

CRISP-DM stands for Cross Industry Standard Process for Data Mining. It talks about various process phases / steps / lifecycle stages in a typical data science program. Lifecycle stages are 1) Business Understanding , 2) Data Understanding , 3) Data Preparation , 4) Model Development , 5) Model Evaluation and 6) Deployment.

6. (Step: 1) Machine Learning Fundamentals

  • Go through Concepts and Methodology, End to end lifecycle stages, in depth focus into the following key techniques
  • Core ML Methods
    • Regression Techniques - Regression YouTube Ref
    • Classification (Decision Trees, Random Forest, XGBoost, CatBoost, AdaBoost, LightGBM, SVM etc.)
    • Clustering (examples: K-Means, Hierarchical - with Agglomerative as well, DBSCAN: Density Based Spatial Clustering of Applications with Noise, Affinity Propagation, BIRCH: Balanced Iterative Reducing and Clustering using hierarchies, Mean-Shift, OPTICS: Ordering Points To Identify the Clustering Structure, Spectral, Expectation-Maximization using Gaussian Mixture Model - GMM) , More Algo details
    • Anomaly Detection
    • Time series Forecasting
    • Recommendation
    • Association

plot of ML Concepts

plot of ML Lifecycle

All about Core Machine Learning - Supervised, Unsupervised, Reinforcement

7. (Step: 2) Deep Learning Fundamentals

  • Deep Learning by Yoshua Bengio, Ian Goodfellow and Aaron Courville (2015)
  • Deep Dive around Deep Learning with D2L.ai and Core Foundational aspects as well
  • deeplearning.ai from Andrew Ng
    • Watch DeepLearning.ai course lectures I, II, IV and V (2a)
    • Watch DeepLearning.ai course lecture III (2c)
    • Go through DeepLearning.ai assignments (2d)
  • fast.ai from Jeremy Howard and R. Thomas
    • Go through fast.ai Part 1 (2b)
  • Repeat steps 2a through 2d (2e)

Sequence of recommended learning could be 2a, 2b, 2c, 2d, 2e etc.

Other references on fundamentals:

8. (Step: 3) Delve into some depth around ML and DL

  • Understand Use Case themes and Use Cases that are potentially needed to be solved with the help of ML, DL and AI as a whole. Pls refer here for some glance at use case
  • Go through few use cases and try solving use case(s) - be it from forums such as Kaggle or from your existing ecosystem or firm and/or something equivalent. This will give first hand exposure of end to end applied Data Science.
  • Go in depth by having some interview experience and questions, references to prepare yourself for next level - Preparation References
  • Use Kaggle and similar forums - Check here for some top forums to hone your skills
  • Understand a framework and try to use it - PyTorch, Keras, Tensorflow
  • Write effective technical blogs
  • AutoML capability focus: for example, try to explore below libraries
    • auto-sklearn
      • If your priority is a simple, clean interface and relatively quick results
      • Natural integration with sklearn
      • Works with commonly used models and methods
      • control over timing
    • TPOT (Tree based Pipeline Optimization Tool)
      • If priority is accuracy, with disregard for potentially long train times
      • Emphasis on advanced pre-processing methods
      • Outputs Python code for the best models
    • HyperOpt-sklearn
    • AutoKeras

9. (Step: 4) Deep dive into ML and DL

10. (Step: Infinity) Continuous Learning, Stay Current

This is all about continuous learning - what I refer as CD learning - Continuous Deep Learning. Sky is the limit. Keep yourself up to date and continue to learn. There is no end to it. Learn some basics around Generative AI:

plot of ML Pipeline sample using GCP

11. Additional References

Business Use Cases by Industry (Illustrative)

Industry / Domain Area Use Case Description
[BFSI / FinTech]
Banking and Financial Services
Capital Markets
Insurance
  • 1: Customer Segmentation, Customer Micro-Segmentation
  • 2: Risk Analytics and Regulation, Compliance
  • 3: Cross Selling and Up-selling
  • 4: Predictive Maintenance
  • 5: Customer Life Time Value Analysis
  • 6: Sales and Marketing Campaign Management
  • 7: Evaluation of Credit Worthiness
[Retail and CPG]
Retail
Consumer Packaged Goods
  • 1: Predictive Inventory Planning, Predictive Maintenance
  • 2: Recommendation Engines
  • 3: Upsell and Cross Channel Marketing
  • 4: Market Segmentation and Targeting
  • 5: Market Basket Analysis with Association Rules
  • 6: Customer ROI and Life time value analysis
[Healthcare and Life Sciences]
Healthcare
Life Sciences
  • 1: Personalization of Patient Care
  • 2: Proactive Health Management
  • 3: Patient Triangle Optimization
  • 4: Alerts and Diagnostics from Real Time Patient Data
  • 5: Disease Identification and Risk Stratification
  • 6: Healthcare Provider Sentiment Analysis
[Travel and Hospitality]
Travel & Logistics
Hospitality Services
  • 1: Price Optimization, Dynamic Pricing
  • 2: Aircraft Scheduling
  • 3: Social Media Consumer Feedback and Interaction Analysis
  • 4: Customer Complaint Resolution
  • 5: Traffic Patterns and Congestion Management, Route Optimization
[Manufacturing]
Manufacturing
  • 1: Predictive Maintenance
  • 2: Sales and Demand Forecasting
  • 3: Process Optimization
  • 4: Telematics
  • 5: Warranty Analytics (Warranty reserve estimation)
  • 6: Procurement and Spend Analytics
  • Some of other use cases could be described as follows:
    • Improving the aftermath management of an event such as earthquake or equivalent natural disaster
    • Preventing gang and gun violence using SMA (Social Media Analytics)
    • Applying Deep Learning and AI to detect wildfires and help prevent the same

The advantages of ML are useful where large dataset is available. Large scale deployments of ML is beneficial in terms of improved velocity and accuracy. It helps in understanding non-linearity of the data and generates a function mapping input to ouput from supervised learning standpoint. Lot of aspects around supervised, unsupervised and reinforced learning can be performed. This by and large ensures better profiling of customers to understand their needs. It helps serve customers better and reduce customer attrition.

Quick Guideline points for Beginner and Intermediate levels

Level in Data and AI Guideline or Checklist points
[Level 1]
Beginner Level
Level 1: Beginner Stage
  • 1: Academia background of Mathematics and Statistics
  • 2: Exposure to Programming skills
  • 2.1: Concepts of fundamentals in programming languages such as Python and R
  • 2.2: Style aspect for Python and R - mentioned above as a suggestion
  • 2.3: Readability - Has comments, indentation as per style guide
  • 2.4: Modular - code is broken into small parts, functions, sub-routines as needed
  • 2.5: Flow of control - code should perform what it is meant for
  • 3: Understanding of Story telling
  • 4: Understanding of Methodology to drive business problems to data problems
  • 5: Understanding of Business KPIs and Drivers
  • 6: Exposure to environment, tools and technologies at a high level
  • 7: Exposure to Python or R
  • 8: Data Visualization and EDA
[Level 2]
Intermediate Level
Level 2: Intermediate Stage
  • 1: Understanding of all that is required at a "Beginner Level"
  • 2: Ability to formulate different techniques for a problem
  • 3: GFamiliarity with Python and R with a grip on one of those strongly
  • 4: Strong applied skills in EDA
  • 5: Strong story telling and Data visualization
  • 6: Machine Learning
  • 7: Deep Learning

Data Visualization and Storytelling

Below are some of references that can be referred for learning (but not exhaustive list by any means)

Apache Superset - for Data exploration and visualization

Reference: https://github.com/apache/incubator-superset Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Miscellaneous References

Learning Reference for Probability and Stats

  • Towardsdatascience
  • Elitedatascience
  • Khan Academy
  • OpenIntro
  • Exam Solutions
  • Seeing Theory
  • OLI
  • Class Central
  • Alison
  • Guru99

Sites / References to learn Python

Datasets For Exploration and Usage

Computer Vision

Computer Vision is a sub branch of AI whose objective is to provide computers the powerful ability to understand their sorrounding by seeing the things more than hearing or feeling, just like humans. Kind of mimicing human ability to interpret by learning certain aspects. Some applications of Computer Vision are as follows:

  • Controlling processes
  • Navigation
  • Organizing set of information
  • Automatic inspection
  • Modeling objects or environments
  • Detecting events
  • Recognize objects
  • Recognize actions
  • Track objects in action

NLP

  • Key methods to look for: word2vec, ELMo, ULMFiT, GPT, BERT, RoBERTa, GloVe, InferSent, skip-thought
  • While using AWS Comprehend try to understand an exposure towards how does it work Refer AWS Comprehend here for NLP
Topic Description Remarks
ELMo Embeddings from Language Models Utilizes bi-directional LSTM for specific tasks to look at a whole sentence prior to encoding a word. ELMo's LSTM is trained on huge amount of text dataset
ULMFiT Universal Language Model Fine-Tuning method Transfer learning method for NLP task and demonstrated techniques that are key to fine tuning language model
GPT Generative Pre-training Transformer (OpenAI)
BERT Bi-directional Encoder Representations from Transformer
GloVe Global Vectors for word representation Unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a text corpus, and the resulting representations showcase interesting linear substructures of the word vector space

plot of Compare view between BERT, GPT, ELMo

12. Experiments with sample solutions

Part 1: R Programming with univariate and bivariate analysis

These are program components which are used for mentoring purposes

Part 2: Time series Forecasting in R

These are program components which are used for mentoring purposes

Predict Web Page Tags based on its content

Classification of Web page content is vital to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, however the interconnected nature of hypertext also provides features that can assist the process.

Here the task is to classify the web pages to the respective classes it belongs to, in a single label classification setup (Each webpage can belong to only 1 class).

Basically given the complete html and url, predict the tag a web page belongs to out of 9 predefined tags as given below:

  • People profile
  • Conferences/Congress
  • Forums
  • News article
  • Clinical trials
  • Publication
  • Thesis
  • Guidelines
  • Others

Other References for Reading

Applied Machine Learning videos reference: https://www.youtube.com/playlist?list=PL_pVmAaAnxIQGzQS2oI3OWEPT-dpmwTfA

Disclaimer: Information represented here is based on my own experiences, learnings, readings and no way represent any firm's opinion, strategy etc or any individual's opinion or not intended for anything else other than self learning.

About

Data Science / AI / ML / DL for Mentees: Focus on methodology, techniques, use cases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages