What does it take to execute our "strategy" - true power knows no bounds? What does it take to pursue our passion or short burst? Education and Learning is key to success and empowerment!! With my experience around Data Science and AI, solving business problems for global clients across industries at scale, mentoring and building teams for success, defining roadmap for maturity assessments, and many more, I strongly recommend structured "thinking" and "planning" to accomplish our goals and outcomes, which can create impact and value.
- a) My recommended 7 + 1 steps for Data Science and AI mentoring approach
- b) Quick guideline points / kind of a checklist of what to look for across various levels (Beginner, Intermediate etc.)
- c) Why do we need Data and AI - what are typical business use cases that we should get a feel of?
- d) References about some key areas
(Step -2, -1, 0, 1, 2, 3, 4 and Infinity)
Goal is to provide some thoughts, pointers around Data Science and AI for learners, data science aspirants and data science practitioners to get a direction. I get an opportunity to learn from them, my team and everyone that I interact and collaborate across around this theme.
- Target Audience wise pointers:
- If you are a student / fresher with no experience - Focus initially on Core, Breadth and Depth of Data Science and prioritize some key areas such as Probability, Statistics, Linear Algebra, Data Visualization, Data Munging, Data Wrangling, EDA, Fundamentals of ML methods
- If you are a working professional with no prior DS experience - All above points + Business use case focus + SDLC process + Cloud essentials + Productionization of ML solutions + Value from Data and AI etc
- If you are a working professional with 0-10 yrs of experience and working as a DS practitioner - You would have gone through most of these. However, please continue to focus and accelerate your understanding. More focus on DS/ML best practices, end to end implementation aspects
- If you are a working professional with 10+ yrs of experience and working as a DS practitioner - All points + Thought Leadership
- All DS practitioners have to maintain GitHub to some extent (varies based on roles such as Research Scientist, Principal Data Scientist, Applied Data and AI scientist, ML Engineer, Data Analyst, Data Scientist, Data Engineer, Data Journalist, Director Data Science, VP Data Science, Chief Data Scientist roles etc.)
- For all roles, it is encouraged to have innovation mindset: Patents, Publications, Writing Blogs on forums such as Medium.com
Data Science is a journey and there are no short cuts around it like everything else where we want get into success. Let me also be very clear and request some of these key points to my mentees and whoever is reading this content. The content we need depends on the objectives that we would intend to pursue in life. The foundations and interpretations may vary depending on what we want to accomplish in a specific span - a Data Science Practitioner, Applied Data Scientist, Research Scientist and so on. Every role would expect us to get into different flavours of understanding and focus.
I would like to create an outline as an initial draft approach.
- Mindset of holistic approach and experimental / iterative way of exploring methods
- Mindset of combinatorial concepts around Mathematics + Computer Science + Statistics + Programming + Story telling
- Mindset of Data Science Method/Approach to Success; Outcome and Impact driven objectives
- How to think like a Computer Scientist
Focus on a business problem and trying to understand business KPIs, drivers that are required as part of goal formulation and define strategy accordingly alligning to CRISP-DM methodology from an end to end Data Science perspective. Below diagram depicts a high level outline of the approach.
- Understanding Fundamental Concepts around ingradients such as Statistics, Linear Algebra, Programming
- Probability for Data Science
- R for data science by Hadley W and Garret G
- Check for Statistics Fundamentals:
- Fundamentals of Python Programming - this is just an example, you can also explore R Programming as well and another one from Stanford Python Basics here
- MIT Single Variable Calculus
- Khan Academy Calculus
- Linear Algebra related: MIT reference
- Need for Machine Learning
- You would always need a DATA framework and that follows A-C-R-A path
- Data Must be Adequate
- Data Must be Connected
- Data Must be Relevant
- Data Must be Accurate
- You would always need a DATA framework and that follows A-C-R-A path
- Methodology: Fundamentals of CRISP-DM Methodologies One must understand key aspects on CRISP-DM methodlogy, applications / use cases of Machine Learning. This repository focuses on some tutorials using open datasets. The Venn Diagram shows how specific dimensions are related.
CRISP-DM stands for Cross Industry Standard Process for Data Mining. It talks about various process phases / steps / lifecycle stages in a typical data science program. Lifecycle stages are 1) Business Understanding , 2) Data Understanding , 3) Data Preparation , 4) Model Development , 5) Model Evaluation and 6) Deployment.
-
Understanding where to spend 80% of effort in an end to end DS journey and where to spend 20% - 80-20 aspect that I call
-
Reference for effective and professional data science coding
-
Data Visualization and EDA
- Data Visualization Concepts and Principles
- Intro and AutoEDA using Pandas Profiling here
- AutoEDA using DTale
- AutoEDA using LUX
- AutoEDA using DataPrep
- AutoEDA using SweetViz
- Missing value analysis
- Outlier treatment
- Feature transformation and creation
-
Data Science and AI Forums: Some of the following Data and AI forums can be considered for learning purposes, participating and collaborating in real-life projects or initiatives:
- Kaggle forum
- DS competitions to build a better world
- Enabling impact organizations to collaborate and work on a project to solutionize asap
- Crowd Analytics collaboration
- Data camp to solve real world problems
- InnoCentive - Open innovation and crowdsourcing company which primarily focuses on problems dealing with life sciences
- Codalab - Accelerating reproducible computational research
- Go through Concepts and Methodology, End to end lifecycle stages, in depth focus into the following key techniques
- Core ML Methods
- Regression Techniques - Regression YouTube Ref
- Classification (Decision Trees, Random Forest, XGBoost, CatBoost, AdaBoost, LightGBM, SVM etc.)
- Clustering (examples: K-Means, Hierarchical - with Agglomerative as well, DBSCAN: Density Based Spatial Clustering of Applications with Noise, Affinity Propagation, BIRCH: Balanced Iterative Reducing and Clustering using hierarchies, Mean-Shift, OPTICS: Ordering Points To Identify the Clustering Structure, Spectral, Expectation-Maximization using Gaussian Mixture Model - GMM) , More Algo details
- Anomaly Detection
- Time series Forecasting
- Recommendation
- Association
All about Core Machine Learning - Supervised, Unsupervised, Reinforcement
- Feature Engineering tricks here
- Model selection tricks here - with few visualization examples from Yellowbrick repo reference for learning
- Other Learning references
- Auto ML capabilities and practical implementation strategies using various Cloud platforms such as AWS, Azure, IBM and GCP
- Deep Learning by Yoshua Bengio, Ian Goodfellow and Aaron Courville (2015)
- Deep Dive around Deep Learning with D2L.ai and Core Foundational aspects as well
- deeplearning.ai from Andrew Ng
- Watch DeepLearning.ai course lectures I, II, IV and V (2a)
- Watch DeepLearning.ai course lecture III (2c)
- Go through DeepLearning.ai assignments (2d)
- fast.ai from Jeremy Howard and R. Thomas
- Go through fast.ai Part 1 (2b)
- Repeat steps 2a through 2d (2e)
Sequence of recommended learning could be 2a, 2b, 2c, 2d, 2e etc.
- Neural Networks and Deep Learning by Michael Nielsen (Dec 2014)
- Deep Learning by MSFT Research (2014)
- CNN for visual recognition - CS231n:Part1: Setting up the architecture
- CNN for visual recognition - CS231n:Part2: Setting up the data and loss
- CNN for visual recognition - CS231n:Part3: Learning and evaluation
Other references on fundamentals:
- PyTorch Fundamentals - Course Material and Course Video
- Intro to deep learning and neural networks
- Improving neural networks with Hyper parameter tuning, Regularization and Others
- CNN from scratch
- Amazon ML University related Computer Vision GitHub reference
- Understand Use Case themes and Use Cases that are potentially needed to be solved with the help of ML, DL and AI as a whole. Pls refer here for some glance at use case
- Go through few use cases and try solving use case(s) - be it from forums such as Kaggle or from your existing ecosystem or firm and/or something equivalent. This will give first hand exposure of end to end applied Data Science.
- Go in depth by having some interview experience and questions, references to prepare yourself for next level - Preparation References
- Use Kaggle and similar forums - Check here for some top forums to hone your skills
- Understand a framework and try to use it - PyTorch, Keras, Tensorflow
- Write effective technical blogs
- AutoML capability focus: for example, try to explore below libraries
- auto-sklearn
- If your priority is a simple, clean interface and relatively quick results
- Natural integration with sklearn
- Works with commonly used models and methods
- control over timing
- TPOT (Tree based Pipeline Optimization Tool)
- If priority is accuracy, with disregard for potentially long train times
- Emphasis on advanced pre-processing methods
- Outputs Python code for the best models
- HyperOpt-sklearn
- AutoKeras
- auto-sklearn
- Dive into Deep Learning - with D2L.ai
- Applications and Use Cases leveraging CNN, RNN, LSTM
- Area wise use cases around - Forecasting, Classification, Clustering, Association etc.
- Research based thinking / analysis to solve novel problems / methods / approaches
- Full stack Deep Learning - deploy AI solutions in real world
This is all about continuous learning - what I refer as CD learning - Continuous Deep Learning. Sky is the limit. Keep yourself up to date and continue to learn. There is no end to it. Learn some basics around Generative AI:
-
ChatGPT Prompt Engineering for Developers from Deeplearning.AI
-
ML Pipeline illustrative view using GCP
- MLOps MLOps Tooling Landscape
Industry / Domain Area | Use Case Description |
---|---|
[BFSI / FinTech] Banking and Financial Services Capital Markets Insurance |
|
[Retail and CPG] Retail Consumer Packaged Goods |
|
[Healthcare and Life Sciences] Healthcare Life Sciences |
|
[Travel and Hospitality] Travel & Logistics Hospitality Services |
|
[Manufacturing] Manufacturing |
|
- Some of other use cases could be described as follows:
- Improving the aftermath management of an event such as earthquake or equivalent natural disaster
- Preventing gang and gun violence using SMA (Social Media Analytics)
- Applying Deep Learning and AI to detect wildfires and help prevent the same
The advantages of ML are useful where large dataset is available. Large scale deployments of ML is beneficial in terms of improved velocity and accuracy. It helps in understanding non-linearity of the data and generates a function mapping input to ouput from supervised learning standpoint. Lot of aspects around supervised, unsupervised and reinforced learning can be performed. This by and large ensures better profiling of customers to understand their needs. It helps serve customers better and reduce customer attrition.
Level in Data and AI | Guideline or Checklist points |
---|---|
[Level 1] Beginner Level |
Level 1: Beginner Stage
|
[Level 2] Intermediate Level |
Level 2: Intermediate Stage
|
Below are some of references that can be referred for learning (but not exhaustive list by any means)
- DV guidelines by Edward Tufte
- Storytelling with Data
- Information is Beautiful
- Junk Charts
- The Atlas
- The Pudding
- Flowing Data
- Visualising Data
Reference: https://github.com/apache/incubator-superset Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
- Towardsdatascience
- Elitedatascience
- Khan Academy
- OpenIntro
- Exam Solutions
- Seeing Theory
- OLI
- Class Central
- Alison
- Guru99
- Python for Beginners - 1
- Python for Beginners - 2
- Learn Python programming in 7 days, Guru99
- Pythonspot
- Code Academy
- TutorialsPoint
- The Python org
- Interactive Python
- Python Tutor
- Awesome Python
- Awesome Python Github Reference
- Full Stack Python
- CheckiO
- Google Datasets
- Data.world
- Kaggle datasets
- US Govertnment Open Datasets for usage
- FiveThirtyEight
- BuzzFeed
- Socrata OpenData
- UCI Machine Learning Repository
- Reddit or R/datasets
- Quandl
- Academic Torrents
- This is a great compilation of Awesome Public DataSets as well
Computer Vision is a sub branch of AI whose objective is to provide computers the powerful ability to understand their sorrounding by seeing the things more than hearing or feeling, just like humans. Kind of mimicing human ability to interpret by learning certain aspects. Some applications of Computer Vision are as follows:
- Controlling processes
- Navigation
- Organizing set of information
- Automatic inspection
- Modeling objects or environments
- Detecting events
- Recognize objects
- Recognize actions
- Track objects in action
- Key methods to look for: word2vec, ELMo, ULMFiT, GPT, BERT, RoBERTa, GloVe, InferSent, skip-thought
- While using AWS Comprehend try to understand an exposure towards how does it work Refer AWS Comprehend here for NLP
Topic | Description | Remarks |
---|---|---|
ELMo | Embeddings from Language Models | Utilizes bi-directional LSTM for specific tasks to look at a whole sentence prior to encoding a word. ELMo's LSTM is trained on huge amount of text dataset |
ULMFiT | Universal Language Model Fine-Tuning method | Transfer learning method for NLP task and demonstrated techniques that are key to fine tuning language model |
GPT | Generative Pre-training Transformer (OpenAI) | |
BERT | Bi-directional Encoder Representations from Transformer | |
GloVe | Global Vectors for word representation | Unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a text corpus, and the resulting representations showcase interesting linear substructures of the word vector space |
- Unsupervised Cross-lingual representative learning
- The State and Fate of linguistic diversity
- Reference to Open Datsets could be as follows:
These are program components which are used for mentoring purposes
These are program components which are used for mentoring purposes
Classification of Web page content is vital to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, however the interconnected nature of hypertext also provides features that can assist the process.
Here the task is to classify the web pages to the respective classes it belongs to, in a single label classification setup (Each webpage can belong to only 1 class).
Basically given the complete html and url, predict the tag a web page belongs to out of 9 predefined tags as given below:
- People profile
- Conferences/Congress
- Forums
- News article
- Clinical trials
- Publication
- Thesis
- Guidelines
- Others
Applied Machine Learning videos reference: https://www.youtube.com/playlist?list=PL_pVmAaAnxIQGzQS2oI3OWEPT-dpmwTfA
Disclaimer: Information represented here is based on my own experiences, learnings, readings and no way represent any firm's opinion, strategy etc or any individual's opinion or not intended for anything else other than self learning.