1. The Premise

What does it take to execute our "strategy" - true power knows no bounds? What does it take to pursue our passion or short burst? Education and Learning is key to success and empowerment!! With my experience around Data Science and AI, solving business problems for global clients across industries at scale, mentoring and building teams for success, defining roadmap for maturity assessments, and many more, I strongly recommend structured "thinking" and "planning" to accomplish our goals and outcomes, which can create impact and value.

The content here is focusing on the following:

a) My recommended 7 + 1 steps for Data Science and AI mentoring approach
b) Quick guideline points / kind of a checklist of what to look for across various levels (Beginner, Intermediate etc.)
c) Why do we need Data and AI - what are typical business use cases that we should get a feel of?
d) References about some key areas

2. Data Science and AI Learning Approach - My 7+1 steps

(Step -2, -1, 0, 1, 2, 3, 4 and Infinity)

Objective

Goal is to provide some thoughts, pointers around Data Science and AI for learners, data science aspirants and data science practitioners to get a direction. I get an opportunity to learn from them, my team and everyone that I interact and collaborate across around this theme.

Target Audience wise pointers:
- If you are a student / fresher with no experience - Focus initially on Core, Breadth and Depth of Data Science and prioritize some key areas such as Probability, Statistics, Linear Algebra, Data Visualization, Data Munging, Data Wrangling, EDA, Fundamentals of ML methods
- If you are a working professional with no prior DS experience - All above points + Business use case focus + SDLC process + Cloud essentials + Productionization of ML solutions + Value from Data and AI etc
- If you are a working professional with 0-10 yrs of experience and working as a DS practitioner - You would have gone through most of these. However, please continue to focus and accelerate your understanding. More focus on DS/ML best practices, end to end implementation aspects
- If you are a working professional with 10+ yrs of experience and working as a DS practitioner - All points + Thought Leadership
- All DS practitioners have to maintain GitHub to some extent (varies based on roles such as Research Scientist, Principal Data Scientist, Applied Data and AI scientist, ML Engineer, Data Analyst, Data Scientist, Data Engineer, Data Journalist, Director Data Science, VP Data Science, Chief Data Scientist roles etc.)
- For all roles, it is encouraged to have innovation mindset: Patents, Publications, Writing Blogs on forums such as Medium.com

My Recommendation

Data Science is a journey and there are no short cuts around it like everything else where we want get into success. Let me also be very clear and request some of these key points to my mentees and whoever is reading this content. The content we need depends on the objectives that we would intend to pursue in life. The foundations and interpretations may vary depending on what we want to accomplish in a specific span - a Data Science Practitioner, Applied Data Scientist, Research Scientist and so on. Every role would expect us to get into different flavours of understanding and focus.

I would like to create an outline as an initial draft approach.

3. (Step: -2): Pre-Requisites: Mindset

Mindset of holistic approach and experimental / iterative way of exploring methods
Mindset of combinatorial concepts around Mathematics + Computer Science + Statistics + Programming + Story telling
Mindset of Data Science Method/Approach to Success; Outcome and Impact driven objectives
How to think like a Computer Scientist

Focus on a business problem and trying to understand business KPIs, drivers that are required as part of goal formulation and define strategy accordingly alligning to CRISP-DM methodology from an end to end Data Science perspective. Below diagram depicts a high level outline of the approach.

4. (Step: -1) Fundamentals of Building Blocks: Statistics, Linear Algebra, Programming, Need for ML

Understanding Fundamental Concepts around ingradients such as Statistics, Linear Algebra, Programming
- Probability for Data Science
- R for data science by Hadley W and Garret G
- Check for Statistics Fundamentals:
  - ThinkStats
  - ISLR book and reference
- Fundamentals of Python Programming - this is just an example, you can also explore R Programming as well and another one from Stanford Python Basics here
- MIT Single Variable Calculus
- Khan Academy Calculus
- Linear Algebra related: MIT reference
Need for Machine Learning
- You would always need a DATA framework and that follows A-C-R-A path
  - Data Must be Adequate
  - Data Must be Connected
  - Data Must be Relevant
  - Data Must be Accurate

5. (Step: 0) DS/AI Ecosystem, Methodologies and DV/EDA

Methodology: Fundamentals of CRISP-DM Methodologies One must understand key aspects on CRISP-DM methodlogy, applications / use cases of Machine Learning. This repository focuses on some tutorials using open datasets. The Venn Diagram shows how specific dimensions are related.

CRISP-DM stands for Cross Industry Standard Process for Data Mining. It talks about various process phases / steps / lifecycle stages in a typical data science program. Lifecycle stages are 1) Business Understanding , 2) Data Understanding , 3) Data Preparation , 4) Model Development , 5) Model Evaluation and 6) Deployment.

Understanding where to spend 80% of effort in an end to end DS journey and where to spend 20% - 80-20 aspect that I call
Style guide for Python
Style guide for R
Reference for effective and professional data science coding
Data Visualization and EDA
- Data Visualization Concepts and Principles
- Intro and AutoEDA using Pandas Profiling here
- AutoEDA using DTale
- AutoEDA using LUX
- AutoEDA using DataPrep
- AutoEDA using SweetViz
- Missing value analysis
- Outlier treatment
- Feature transformation and creation
Data Science and AI Forums: Some of the following Data and AI forums can be considered for learning purposes, participating and collaborating in real-life projects or initiatives:

6. (Step: 1) Machine Learning Fundamentals

Go through Concepts and Methodology, End to end lifecycle stages, in depth focus into the following key techniques
Core ML Methods
- Regression Techniques - Regression YouTube Ref
- Classification (Decision Trees, Random Forest, XGBoost, CatBoost, AdaBoost, LightGBM, SVM etc.)
- Clustering (examples: K-Means, Hierarchical - with Agglomerative as well, DBSCAN: Density Based Spatial Clustering of Applications with Noise, Affinity Propagation, BIRCH: Balanced Iterative Reducing and Clustering using hierarchies, Mean-Shift, OPTICS: Ordering Points To Identify the Clustering Structure, Spectral, Expectation-Maximization using Gaussian Mixture Model - GMM) , More Algo details
- Anomaly Detection
- Time series Forecasting
- Recommendation
- Association

All about Core Machine Learning - Supervised, Unsupervised, Reinforcement

Feature Engineering tricks here
Model selection tricks here - with few visualization examples from Yellowbrick repo reference for learning
Other Learning references
Auto ML capabilities and practical implementation strategies using various Cloud platforms such as AWS, Azure, IBM and GCP

7. (Step: 2) Deep Learning Fundamentals

Deep Learning by Yoshua Bengio, Ian Goodfellow and Aaron Courville (2015)
Deep Dive around Deep Learning with D2L.ai and Core Foundational aspects as well
deeplearning.ai from Andrew Ng
- Watch DeepLearning.ai course lectures I, II, IV and V (2a)
- Watch DeepLearning.ai course lecture III (2c)
- Go through DeepLearning.ai assignments (2d)
fast.ai from Jeremy Howard and R. Thomas
- Go through fast.ai Part 1 (2b)
Repeat steps 2a through 2d (2e)

Sequence of recommended learning could be 2a, 2b, 2c, 2d, 2e etc.

Neural Networks and Deep Learning by Michael Nielsen (Dec 2014)
Deep Learning by MSFT Research (2014)
CNN for visual recognition - CS231n:Part1: Setting up the architecture
CNN for visual recognition - CS231n:Part2: Setting up the data and loss
CNN for visual recognition - CS231n:Part3: Learning and evaluation

Other references on fundamentals:

8. (Step: 3) Delve into some depth around ML and DL

Understand Use Case themes and Use Cases that are potentially needed to be solved with the help of ML, DL and AI as a whole. Pls refer here for some glance at use case
Go through few use cases and try solving use case(s) - be it from forums such as Kaggle or from your existing ecosystem or firm and/or something equivalent. This will give first hand exposure of end to end applied Data Science.
Go in depth by having some interview experience and questions, references to prepare yourself for next level - Preparation References
Use Kaggle and similar forums - Check here for some top forums to hone your skills
Understand a framework and try to use it - PyTorch, Keras, Tensorflow
Write effective technical blogs
AutoML capability focus: for example, try to explore below libraries
- auto-sklearn
  - If your priority is a simple, clean interface and relatively quick results
  - Natural integration with sklearn
  - Works with commonly used models and methods
  - control over timing
- TPOT (Tree based Pipeline Optimization Tool)
  - If priority is accuracy, with disregard for potentially long train times
  - Emphasis on advanced pre-processing methods
  - Outputs Python code for the best models
- HyperOpt-sklearn
- AutoKeras

9. (Step: 4) Deep dive into ML and DL

Dive into Deep Learning - with D2L.ai
Applications and Use Cases leveraging CNN, RNN, LSTM
Area wise use cases around - Forecasting, Classification, Clustering, Association etc.
Research based thinking / analysis to solve novel problems / methods / approaches
Full stack Deep Learning - deploy AI solutions in real world

10. (Step: Infinity) Continuous Learning, Stay Current

This is all about continuous learning - what I refer as CD learning - Continuous Deep Learning. Sky is the limit. Keep yourself up to date and continue to learn. There is no end to it. Learn some basics around Generative AI:

Generative AI Basics from LinkedIn Learning
Foundations of Prompt Engineering from AWS
ChatGPT Prompt Engineering for Developers from Deeplearning.AI
ML Pipeline illustrative view using GCP

MLOps MLOps Tooling Landscape

11. Additional References

Business Use Cases by Industry (Illustrative)

Industry / Domain Area	Use Case Description
[BFSI / FinTech] Banking and Financial Services Capital Markets Insurance	1: Customer Segmentation, Customer Micro-Segmentation 2: Risk Analytics and Regulation, Compliance 3: Cross Selling and Up-selling 4: Predictive Maintenance 5: Customer Life Time Value Analysis 6: Sales and Marketing Campaign Management 7: Evaluation of Credit Worthiness
[Retail and CPG] Retail Consumer Packaged Goods	1: Predictive Inventory Planning, Predictive Maintenance 2: Recommendation Engines 3: Upsell and Cross Channel Marketing 4: Market Segmentation and Targeting 5: Market Basket Analysis with Association Rules 6: Customer ROI and Life time value analysis
[Healthcare and Life Sciences] Healthcare Life Sciences	1: Personalization of Patient Care 2: Proactive Health Management 3: Patient Triangle Optimization 4: Alerts and Diagnostics from Real Time Patient Data 5: Disease Identification and Risk Stratification 6: Healthcare Provider Sentiment Analysis
[Travel and Hospitality] Travel & Logistics Hospitality Services	1: Price Optimization, Dynamic Pricing 2: Aircraft Scheduling 3: Social Media Consumer Feedback and Interaction Analysis 4: Customer Complaint Resolution 5: Traffic Patterns and Congestion Management, Route Optimization
[Manufacturing] Manufacturing	1: Predictive Maintenance 2: Sales and Demand Forecasting 3: Process Optimization 4: Telematics 5: Warranty Analytics (Warranty reserve estimation) 6: Procurement and Spend Analytics

Some of other use cases could be described as follows:
- Improving the aftermath management of an event such as earthquake or equivalent natural disaster
- Preventing gang and gun violence using SMA (Social Media Analytics)
- Applying Deep Learning and AI to detect wildfires and help prevent the same

The advantages of ML are useful where large dataset is available. Large scale deployments of ML is beneficial in terms of improved velocity and accuracy. It helps in understanding non-linearity of the data and generates a function mapping input to ouput from supervised learning standpoint. Lot of aspects around supervised, unsupervised and reinforced learning can be performed. This by and large ensures better profiling of customers to understand their needs. It helps serve customers better and reduce customer attrition.

Quick Guideline points for Beginner and Intermediate levels

Level in Data and AI

Guideline or Checklist points

[Level 1]
Beginner Level

Level 1: Beginner Stage

1: Academia background of Mathematics and Statistics
2: Exposure to Programming skills
2.1: Concepts of fundamentals in programming languages such as Python and R
2.2: Style aspect for Python and R - mentioned above as a suggestion
2.3: Readability - Has comments, indentation as per style guide
2.4: Modular - code is broken into small parts, functions, sub-routines as needed
2.5: Flow of control - code should perform what it is meant for
3: Understanding of Story telling
4: Understanding of Methodology to drive business problems to data problems
5: Understanding of Business KPIs and Drivers
6: Exposure to environment, tools and technologies at a high level
7: Exposure to Python or R
8: Data Visualization and EDA

[Level 2]
Intermediate Level

Level 2: Intermediate Stage

1: Understanding of all that is required at a "Beginner Level"
2: Ability to formulate different techniques for a problem
3: GFamiliarity with Python and R with a grip on one of those strongly
4: Strong applied skills in EDA
5: Strong story telling and Data visualization
6: Machine Learning
7: Deep Learning

Data Visualization and Storytelling

Below are some of references that can be referred for learning (but not exhaustive list by any means)

Apache Superset - for Data exploration and visualization

Reference: https://github.com/apache/incubator-superset Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Miscellaneous References

Learning Reference for Probability and Stats

Towardsdatascience
Elitedatascience
Khan Academy
OpenIntro
Exam Solutions
Seeing Theory
OLI
Class Central
Alison
Guru99

Sites / References to learn Python

Datasets For Exploration and Usage

Google Datasets
Data.world
Kaggle datasets
US Govertnment Open Datasets for usage
FiveThirtyEight
BuzzFeed
Socrata OpenData
UCI Machine Learning Repository
Reddit or R/datasets
Quandl
Academic Torrents
This is a great compilation of Awesome Public DataSets as well

Computer Vision

Computer Vision is a sub branch of AI whose objective is to provide computers the powerful ability to understand their sorrounding by seeing the things more than hearing or feeling, just like humans. Kind of mimicing human ability to interpret by learning certain aspects. Some applications of Computer Vision are as follows:

Controlling processes
Navigation
Organizing set of information
Automatic inspection
Modeling objects or environments
Detecting events
Recognize objects
Recognize actions
Track objects in action

NLP

Key methods to look for: word2vec, ELMo, ULMFiT, GPT, BERT, RoBERTa, GloVe, InferSent, skip-thought
While using AWS Comprehend try to understand an exposure towards how does it work Refer AWS Comprehend here for NLP

Topic	Description	Remarks
ELMo	Embeddings from Language Models	Utilizes bi-directional LSTM for specific tasks to look at a whole sentence prior to encoding a word. ELMo's LSTM is trained on huge amount of text dataset
ULMFiT	Universal Language Model Fine-Tuning method	Transfer learning method for NLP task and demonstrated techniques that are key to fine tuning language model
GPT	Generative Pre-training Transformer (OpenAI)
BERT	Bi-directional Encoder Representations from Transformer
GloVe	Global Vectors for word representation	Unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a text corpus, and the resulting representations showcase interesting linear substructures of the word vector space

Unsupervised Cross-lingual representative learning
The State and Fate of linguistic diversity
Reference to Open Datsets could be as follows:

12. Experiments with sample solutions

Part 1: R Programming with univariate and bivariate analysis

These are program components which are used for mentoring purposes

Part 2: Time series Forecasting in R

These are program components which are used for mentoring purposes

Predict Web Page Tags based on its content

Classification of Web page content is vital to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, however the interconnected nature of hypertext also provides features that can assist the process.

Here the task is to classify the web pages to the respective classes it belongs to, in a single label classification setup (Each webpage can belong to only 1 class).

Basically given the complete html and url, predict the tag a web page belongs to out of 9 predefined tags as given below:

People profile
Conferences/Congress
Forums
News article
Clinical trials
Publication
Thesis
Guidelines
Others

Name		Name	Last commit message	Last commit date
Latest commit History 379 Commits
DS_Evaluation_Approach		DS_Evaluation_Approach
Industry		Industry
Installation_Guide		Installation_Guide
Prep_Reference		Prep_Reference
Python		Python
Recommender_Systems		Recommender_Systems
figures		figures
BigData_R.md		BigData_R.md
DataVisualization.R		DataVisualization.R
Part_1.R		Part_1.R
Part_2.R		Part_2.R
Part_2.rmd		Part_2.rmd
README.md		README.md
forums.md		forums.md
predictWebTag.R		predictWebTag.R
predictWebTag.rmd		predictWebTag.rmd

kkm24132/Mentoring_Enablement

Folders and files

Latest commit

History

Repository files navigation