Skip to content

Commit

Permalink
Merge O'Reilly copyedits (#171)
Browse files Browse the repository at this point in the history
* merge md files

* merge ipynb files
  • Loading branch information
SamLau95 committed Jul 15, 2023
1 parent a60b81a commit e7bb752
Show file tree
Hide file tree
Showing 146 changed files with 4,136 additions and 2,709 deletions.
55 changes: 35 additions & 20 deletions content/ch/01/lifecycle_cycle.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,28 +18,30 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Stages of the Lifecycle"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"{numref}`Figure %s <ds-lifecycle>` shows the data science lifecycle.\n",
"It's split into four stages: ask a question, obtain data, \n",
"understand the data, and understand the world.\n",
"The lifecycle is divided into four stages: Ask a Question, Obtain Data, \n",
"Understand the Data, and Understand the World.\n",
"We've purposefully made these stages broad.\n",
"In our experience, the mechanics of the lifecycle change frequently.\n",
"Computer scientists and statisticians continue to build new software packages and programming languages\n",
"for working with data, and they develop new methodologies that are more specialized. \n",
"Despite these changes, we've found that almost every data project follows the four steps in this lifecycle.\n",
"The first step is to ask a question."
"Despite these changes, we've found that almost every data project consists of these four stages:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -48,17 +50,18 @@
"name: ds-lifecycle\n",
"---\n",
"\n",
"This diagram of the data science lifecycle shows four high-level stages.\n",
"The four high-level stages of the data science lifecycle.\n",
"The arrows indicate how the stages can lead into one another.\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Ask a Question.\n",
": Asking good questions lies at the heart of data science, and recognizing\n",
"Ask a Question\n",
": Asking good questions is at the heart of data science, and recognizing\n",
"different kinds of questions guides us in our analyses.\n",
"We cover four categories of questions:\n",
"descriptive, exploratory, inferential, and predictive.\n",
Expand All @@ -70,62 +73,74 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Obtain Data.\n",
": When data are expensive and hard to gather and when our aim is to generalize from the data to the world, we aim to define precise protocols for collecting the data. Other times, data are cheap and easily accessed.\n",
"Obtain Data\n",
": When data are expensive and hard to gather and when our goal is to generalize from the data to the world, we aim to define precise protocols for collecting the data. Other times, data are cheap and easily accessed.\n",
"This is especially true for online data sources.\n",
"For example, [Twitter](https://developer.twitter.com/en/docs/twitter-api) lets people quickly download millions of data points.\n",
"When data are plentiful, we can start an analysis by obtaining data, exploring it, and then honing a research question.\n",
"When data are plentiful, we can start an analysis by obtaining and exploring the data, and then honing a research question.\n",
"In both situations, most data have missing or unusual values and other anomalies that we need to account for. No matter the source, we need to check the data quality. Considering the scope of the data is equally important; for example, we identify how representative the data are and look for potential sources of bias in the collection process. These considerations help us determine how much faith we can place in our findings. And, typically, we must manipulate the data before we can analyze it more formally. We may need to modify structure, clean data values, and transform measurements to prepare for analysis."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Understand the Data.\n",
": After obtaining and preparing data, we want to carefully examine them, and *exploratory data analysis* is often key. In our explorations we make plots to uncover interesting patterns and summarize the data visually. We also continue to look for problems with the data.\n",
"Understand the Data\n",
": After obtaining and preparing data, we want to carefully examine them, and *exploratory data analysis* is often key. In our explorations, we make plots to uncover interesting patterns and summarize the data visually. We also continue to look for problems with the data.\n",
"As we search for patterns and trends, we use summary statistics and build statistical models, like linear and logistic regression.\n",
"In our experience, this stage of the lifecycle is highly iterative.\n",
"Understanding the data can also lead us back to earlier stages in the data science lifecycle. We may find that we need to modify or redo the data cleaning and manipulation, acquire more data to supplement our analysis, or refine our research question given the limitations of the data. The descriptive and exploratory analyses that we carry out in this stage may adequately answer our question, or, we may need to go on to the next stage in order to make generalizations beyond our data."
"Understanding the data can also lead us back to earlier stages in the data science lifecycle. We may find that we need to modify or redo the data cleaning and manipulation, acquire more data to supplement our analysis, or refine our research question given the limitations of the data. The descriptive and exploratory analyses that we carry out in this stage may adequately answer our question, or we may need to go on to the next stage in order to make generalizations beyond our data."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Understand the World.\n",
": When our goals are purely descriptive or exploratory, then the analysis ends at the Understand the Data stage of the lifecycle. \n",
"Understand the World\n",
": When our goals are purely descriptive or exploratory, the analysis ends at the Understand the Data stage of the lifecycle. \n",
"At other times, we aim to quantify how well the trends we find generalize beyond our data. \n",
"We may want to use a model that we have fitted to our data to make inferences about the world or give predictions for future observations. \n",
"We may want to use a model that we have fit to our data to make inferences about the world or give predictions for future observations. \n",
"To draw inferences from a sample to a population, we use\n",
"statistical techniques like A/B testing and confidence intervals.\n",
"And to make predictions for future observations, we create prediction intervals and use test/train splits of the data. "
"And to make predictions for future observations, we create prediction intervals and use train-test splits of the data. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
":::{note}\n",
"\n",
"Understanding the difference between exploration, inference, prediction, and causation can be a challenge. \n",
"Understanding the differences between exploration, inference, prediction, and causation can be a challenge. \n",
"We can easily slip into confusing a correlation found in data with a causal relationship. \n",
"For example, an exploratory or inferential analysis might look for correlations in response to the question, \"Do people who have a greater exposure to air pollution have a higher rate of lung disease?\" Whereas a causal question might ask \"Does giving an award to a Wikipedia contributor increase productivity?\" We typically cannot answer causal questions unless we have a randomized experiment (or approximate one). We point out these important distinctions throughout the book.\n",
"For example, an exploratory or inferential analysis might look for correlations in response to the question \"Do people who have a greater exposure to air pollution have a higher rate of lung disease?\" Whereas a causal question might ask \"Does giving an award to a Wikipedia contributor increase productivity?\" We typically cannot answer causal questions unless we have a randomized experiment (or approximate one). We point out these important distinctions throughout the book.\n",
"\n",
":::"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"For each stage of the lifecycle, we explain theoretical concepts, introduce data technologies and statistical methodologies, and show how they work in practical examples.\n",
"Throughout, we rely on authentic data and analyses by other data scientists, not made-up data, so you can learn how to perform your own data acquisition, cleaning, exploration, and formal analyses, and draw sound conclusions. Each chapter in this book tends to focus on one stage of the data science life cycle, but we also include chapters with case studies that demonstrate the full lifecycle. "
"Throughout, we rely on authentic data and analyses by other data scientists, not made-up data, so you can learn how to perform your own data acquisition, cleaning, exploration, and formal analyses, and draw sound conclusions. Each chapter in this book tends to focus on one stage of the data science lifecycle, but we also include chapters with case studies that demonstrate the full lifecycle. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
7 changes: 5 additions & 2 deletions content/ch/01/lifecycle_intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,22 +18,24 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(ch:lifecycle)=\n",
"# The Data Science Lifecycle\n",
"\n",
"Data science is a rapidly evolving field.\n",
"At the time of this writing people are still trying to pin down exactly\n",
"At the time of this writing, people are still trying to pin down exactly\n",
"what data science is, what data scientists do, and what skills data \n",
"scientists should have.\n",
"What we do know, though, is that data science uses a combination of \n",
"methods and principles from statistics and computer science to work with and draw insights from data.\n",
"And, learning computer science and statistics in combination makes us better data scientists. We also know that any insights we glean need to be interpreted in the context of the problem that we are working on."
"And learning computer science and statistics in combination makes us better data scientists. We also know that any insights we glean need to be interpreted in the context of the problem that we are working on."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -44,6 +46,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down
11 changes: 7 additions & 4 deletions content/ch/01/lifecycle_map.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,32 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Examples of the Lifecycle"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Several case studies that address the entire data science lifecycle are placed throughout this book. \n",
"These cases serve double duty. They focus on one stage in the lifecycle to provide a specific example of the topics in the part of the book that they are located, and they also demonstrate the entire cycle. \n",
"These cases serve double duty. They focus on one stage in the lifecycle to provide a specific example of the topics in the part of the book where they are located, and they also demonstrate the entire cycle. \n",
" \n",
"The focus of {numref}`Chapter %s <ch:bus>` is on the interplay between a question of interest and how data can be used to answer the question. The simple question, \"why is my bus always late?\" provides a rich case study that is basic enough for the beginning data scientist to track the stages of the life cycle, and yet, nuanced enough to demonstrate how we apply both statistical and computational thinking to answer the question. In this case study, we build a simulation study to inform us about the distribution of wait times for riders. And we fit a simple model to summarize the wait times with a statistic. This case study also demonstrates how, as a data scientist, you can collect your own data to answer questions that interest you. \n",
"The focus of {numref}`Chapter %s <ch:bus>` is on the interplay between a question of interest and how data can be used to answer the question. The simple question \"Why is my bus always late?\" provides a rich case study that is basic enough for the beginning data scientist to track the stages of the lifecycle, and yet nuanced enough to demonstrate how we apply both statistical and computational thinking to answer the question. In this case study, we build a simulation study to inform us about the distribution of wait times for riders. And we fit a simple model to summarize the wait times with a statistic. This case study also demonstrates how, as a data scientist, you can collect your own data to answer questions that interest you. \n",
"\n",
"{numref}`Chapter %s <ch:pa>` studies the accuracy of mass-market air sensors that are used across the United States. We devise a way to leverage data from highly accurate sensors maintained by the Environmental Protection Agency to improve readings from less expensive sensors. This case study shows how crowd-sourced, open data can be improved with data from rigorously maintained, precise government-monitored equipment. In the process, we focus on cleaning and merging data from multiple sources, but we also fit models to adjust and improve air quality measurements.\n",
"{numref}`Chapter %s <ch:pa>` studies the accuracy of mass-market air sensors that are used across the United States. We devise a way to leverage data from highly accurate sensors maintained by the Environmental Protection Agency to improve readings from less expensive sensors. This case study shows how crowdsourced, open data can be improved with data from rigorously maintained, precise, government-monitored equipment. In the process, we focus on cleaning and merging data from multiple sources, but we also fit models to adjust and improve air quality measurements.\n",
"\n",
"In {numref}`Chapter %s <ch:donkey>` our focus is on model building and prediction. But, we cover the full lifecycle, and see how the question of interest impacts the model that we build. Our aim is to enable veterinarians in rural Kenya, who have no access to a scale to weigh a donkey, prescribe medication for a sick animal. As we learn about the design of the study, clean the data, and balance simplicity with accuracy, we assess the predictive capabilities of our model and show how scientists can partner with people facing practical problems and assist them with solutions.\n",
"In {numref}`Chapter %s <ch:donkey>` our focus is on model building and prediction. But we cover the full lifecycle and see how the question of interest impacts the model that we build. Our aim is to enable veterinarians in rural Kenya, who have no access to a scale to weigh a donkey, to prescribe medication for a sick animal. As we learn about the design of the study, clean the data, and balance simplicity with accuracy, we assess the predictive capabilities of our model and show how scientists can partner with people facing practical problems and assist them with solutions.\n",
"\n",
"Finally, in {numref}`Chapter %s <ch:fake_news>` we examine hand-classified news stories in an effort to algorithmically detect fake news from real news. In this case study, we again see how readily accessible information creates amazing opportunities for data scientists to develop new technologies and investigate today's important problems. These data have been scraped from new stories on the web and classified as fake or real news by people reading the stories. We also see how data scientists thinking creatively can take general information, such as the content of a news article, and transform them into analyzable data to address topical questions."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down
3 changes: 2 additions & 1 deletion content/ch/01/lifecycle_summary.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -26,7 +27,7 @@
"\n",
"The data science lifecycle provides an organizing structure for this book. We keep the lifecycle in mind as we work with many datasets from a wide range of sources, including science, medicine, politics, social media, and government. The first time we use a dataset, we provide the context in which the data were collected, the question of interest in examining the data, and descriptions needed to understand the data. In this way, we aim to practice good data science throughout the book. \n",
"\n",
"The first stage of the lifecycle--asking a question--is often seen in books as a question that requires an application of a technique to get a number, such as, \"What's the p-value for this A/B test?\" Or, a vague question that is often seen in practice, like \"Can we restore the American Dream?\" Answering the first sort of question gives little practice in developing a research question. Answering the second is hard to do without guidance on how to turn a general area of interest into a question that can be answered with data. The interplay between asking a question and understanding the limitations of data to answer it is the topic of the next chapter."
"The first stage of the lifecycleasking a questionis often seen in books as a question that requires an application of a technique to get a number, such as \"What's the p-value for this A/B test?\" Or a vague question that is often seen in practice, like \"Can we restore the American Dream?\" Answering the first sort of question gives little practice in developing a research question. Answering the second is hard to do without guidance on how to turn a general area of interest into a question that can be answered with data. The interplay between asking a question and understanding the limitations of data to answer it is the topic of the next chapter."
]
},
{
Expand Down

0 comments on commit e7bb752

Please sign in to comment.