- Course: Data Science Project 1 COMP-4447-1
- Class time: M, Wed 07:00 PM - 08:50 PM |Engineering & Computer Science | Room 410
- Instructor: Pooran Singh Negi, pooran.negi@du.edu webpage
- GTA: Mitchell Wright
- Office: 470
- Office Hours: Tue, Thu, 2.00 p.m. - 4.00 p.m. Email for 1-on-1 help.
- Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition by Wes McKinney. It is available online from library
More to come
It is recommended that you consult this github page often for material related to this course. You should check your e-mail periodically for messages. Assignments will be upload here and in the canvas.
The main objective of data science tools 1 is to learn various tools to perform data analysis. Focus in tool1 is data cleanup, summarization ,and visualization. It is more like a hacking skill set but our primary focus will be on the scientific python and Linux ecosystem. We’ll use jupyter notebook/lab for in the class and homeworks. This should make our learning interactive.
For the final project, students will work through individual or team projects applying course-work to the data lifecycle within a particular domain. The focus will also be on best data science/software engineering practices and reproducible work.
Please select a project by January 20th as per your preference. You are allowed to have a group of 2 to 3 students but project work must justify team count. There will be a homework asking about the detail of your final project. We’ll provide feedback about feasibility of the final project. Final projects, Can be based on initial capstone work?. Please let us know if this is the case. We need to go over details.
This syllabus is subject to change at the discretion of the instructor.
- Jupyter Notebook for reproducible workflow.
- Data science and EDA.
- Git tools work flow.
- Data science at command prompt. Linux command line, bash, basic awk and sed.
- Data collection and ingestion(web scrapping and reading datasets + pandas).
- Data cleanup and imputation + Pandas.
- Data summarization and visualization+ panda(groupby, apply, aggregate etc).
- Go over some some topics as per students demands.
- more to come
Linux command line and scientific python ( primarily numpy, matplotlib, request, seaborn, basic pandas) will be used throughout the course.
There will be coding/analysis homework assignments, midterm and a final project. We’ll drop one of your worst assignment grade.
There will be a final presentation of the final project. You will be required to submit a final project report in the jupyter notebook format.
coding Homework | 50% |
---|---|
midterm, 13 Feb in class | 25% |
final project presentation, 15 minutes, 13 March in class | 15% |
final project report, due 15 March, please refer to above final report format for submission guideline | 20% |
grade range [(‘A’, >=93), (‘A_minus’, >=89), (‘B_plus’, >=85), (‘B’, >=81), (‘B_minus’, >=77), (‘C_plus’, >=73), (‘C’, >=69), (‘C_minus’, >=65), (‘D_plus’, >61), (‘D’, >=57), (‘D_minus’, >=53), (‘F’, < 53)])
All members of the University of Denver community are expected to uphold the values of Integrity, Respect, and Responsibility. These values embody the standards of conduct for students, faculty, staff, and administrators as members of the University community. Our institutional values are defined as:
Integrity: acting in an honest and ethical manner;
Respect: honoring differences in people, ideas, experiences, and opinions;
Responsibility: accepting ownership for one’s own behavior and conduct.
Please respect DU Honor Yourself, Honor the Code
Students with recognized disabilities will be provided reasonable accommodations, appropriate to the course, upon documentation of the disability with a Student Accommodation Form from the Disability Services Program. To receive these accommodations, you must request the specific accommodations, by submitting them to the instructor in writing, by the end of first week of classes. Visit CAMPUS LIFE & INCLUSIVE EXCELLENCE webpage for details.
Please see registrar calender for Academic deadlines. We’ll strictly follow the deadlines.
- You can collect the dataset for you project.
- Web scrapping, web API (for natural language processing one can use the New York Times, twitter etc.)
- I am looking around to find noisy dataset for practice.
- See Datasets for data cleaning practice by Rachael Tatman
- Datasets for Data Mining and Data Science
- The EU Open Data Portal
- World Bank Open Data
- The home of the U.S. Government’s open data
We need to know your project/dataset, before we approve it for final project.
More to come.
We want everybody to have same experience using computational tools in data science tools 1. Please follow steps as per your operating system.
Please install Windows Subsystem for Linux (WSL) on window 10. Follow the instruction in this post Using Windows Subsystem for Linux for Data Science by Hugo Ferreira for installing Linux. **ignore install Anaconda part.**
You can also watch this video to see installation of Windows 10 Bash & Linux Subsystem Setup.
You can run echo $0 to check current shell. Change to bash shell using chsh -s /bin/bash
One you are in Linux/Mac bash command prompt, Please follow following instructions
Please follow instructions here to install python3 if it is not installed in your system. This link also lists Windows Subsystem for Linux (WSL) for window 10(Windows 10 Creators or Anniversary Update). I am using python 3.5.2. Hopefully any version of python 3 should work.
Run following commands from command prompt.
- apt-get install python3-venv
- Using command line(cd command), go to the folder where you want to keep python file, notebooks related to this course.
- run **python3 -m venv /path/to/new/virtual/environment**
- e.g. I ran python3 -m venv dst1_env
- To activate you environment run source /path/to/new/virtual/environment/bin/activate
- e.g From this course directory I run, source dst1_env/bin/activate
- run python3 -m pip install – upgrade pip. Note that there are 2 dash in upgrade option.
- run wget https://raw.githubusercontent.com/psnegi/data_science_tools1/master/requirements.txt
- run pip install -r requirements.txt
- run jupyter notebook or jupyter lab.
- In the browser you should see your current files.
- Click on the notebook you want to run.
- click on RISE slideshow extension in notebook, if you want to see notebook as slideshow.
To deactivate python virtual environment, run deactivate
You can also go to my python for reproducible research github repository and start by running pythonBasic.ipynb notebook. I will go over basic of python and jupyter notebook.
- try python notebook online without installing anything
- Runs and visualizes your python code
- The Python Tutorial
- more to come
No late hw will be accepted
HW no | desciption and link | |
Due date | ||
---|---|---|
1 | Complete questions in this notebooks | Friday 18 th Jan 11.59 p.m |
Date | Reading/Coding Assignments | class activity |
---|---|---|
7 Jan | Install jupyter environment | Mitchell covered Jupyter introduction notebook |
also helped with installation | ||
Python Virtual Environments | Covered jupyter introduction and data science notebook. | |
9 Jan | Resources to learn git | It may not be time consuming to wait for notebook to get started via binder every time. |
We’ll also go over data science | Go to the folder for this course in your computer and run git clone https://github.com/psnegi/data_science_tools1.git. | |
Run command ls. You should see data_science_tools1 folder. Activate your virtual environment. | ||
Navigate to course directory using cd data_science_tools1. change to the notebook directory using command cd notebooks. | ||
Now run jupyter notebook. You should see all the notebooks in a browser window. Click on the notebook you want to run. | ||
To run a cell in the notebook press alt+enter or ctr+enter. | ||
Note that whenever a new content is posted, you must run git pull origin master from data_science_tools1 directory to make sure you have the latest | ||
content. Don’t worry about above git commands. We’ll start git in next class. Please start with git notebook. | ||
I don’t like notebooks.- Joel Grus video provide by Laura Atkinson | ||