RCS Data Science and Machine Learning section January 2020 in conjuction with Accenture
Build a complete data analysis pipeline using Python ecosystem
- Define the problem
- Gather the raw data
- Process (clean) the data
- Explore
- Analysis (apply models, make predictions)
- Reports and Visual Results in a form understandable to stakeholders
- Git version control / command line
- Jupyter / Anaconda environment for Data Science
- Text Editors
- Built in Data Types
- Control Structures
- Functions and Classes
- List/Dictionary Comprehensions
- File Manipulation
- Advanced Concepts (Generators/Decorators)
- useful Python standard libraries - Collections, functools, etc
- NumPy/Pandas
- SciPy.Stats
- principles, types, CAP
- Key-value DB, e.g., Redis
- Columnar db, e.g., HBase, Cassandra
- Document db, e.g., MongoDB
- Graph db, e.g., Neo4j [some practical tasks on each]
- get data, transform data
- Data Preperation - preprocessing, tidydata
- Training Data / Testing Data / splitting
- Supervised / Unsupervised learning
- Classification
- Clustering
- Regression
- Dimensionality reduction (curse of dimensionality)
- post-processing
- Visualization Libraries in Python, Plotly, matlplotlib
- Building your own dashboards with Flask web micro framework
- Dashboards with Tableau / PowerBi