Skip to content

1965Eric/HarvardX-PH125.6x-Data-Science-Wrangling

Repository files navigation

Data Science Wrangling

HarvardX: PH125.6x | Data Science: Wrangling

Abstract

This is the sixth course in the HarvardX Professional Certificate in Data Science, a series of courses that prepare you to do data analysis in R, from simple computations to machine learning. We assume that you have taken the preceding five courses in the series or have equivalent knowledge of R programming. We recommend that you complete the first five courses in the series (Data Science: R Basics, Data Science: Visualization, Data Science: Probability, Data Science: Inference and Modeling, and Data Science: Productivity Tools) before taking this course.

Using a combination of guided introduction through short video lectures and more independent in-depth exploration, you will get to practice your new R skills on real-life applications.

In this course, we cover several standard steps of the data wrangling process like importing data into R, tidying data, string processing, HTML parsing, working with dates and times, and text mining. Rarely are all these wrangling steps necessary in a single analysis, but a data scientist will likely face them all at some point.

In a data science project, data are often not easily accessible. It's more likely for the data to be in a file, a database, or extracted from documents such as web pages, tweets, or PDFs. In these cases, the first step is to import the data into R and tidy the data, using the tidyverse package. The steps that convert data from its raw form to the tidy form are called data wrangling.

The class notes for this course series can be found in Professor Irizarry's freely available Introduction to Data Science book. The textbook is also freely available in PDF format on Leanpub. This course corresponds to textbook Chapter 20 through Chapter 26.

The bookdown-version of this course is available on this Github Page

Releases

No releases published

Packages

No packages published