Skip to content

Commit

Permalink
incorporating MP and TR coments in Chapter 9
Browse files Browse the repository at this point in the history
  • Loading branch information
debnolan committed Apr 28, 2023
1 parent 0638a75 commit f56e6cf
Show file tree
Hide file tree
Showing 7 changed files with 317 additions and 258 deletions.
88 changes: 61 additions & 27 deletions content/ch/09/wrangling_checks.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -50,20 +50,30 @@
"(ch:wrangling_checks)=\n",
"# Quality Checks\n",
"\n",
"Once your data are in a table and you understand the scope and granularity, it's time to inspect for quality. You may have come across errors in the source as you examined and wrangled the file into a data frame. In this section, we describe how to continue this inspection and carry out a more comprehensive assessment of the quality of the features and their values. We consider quality from four vantage points: \n",
"Once your data are in a table and you understand the scope and granularity, it's time to inspect for quality. You may have come across errors in the source as you examined and wrangled the file into a data frame. In this section, we describe how to continue this inspection and carry out a more comprehensive assessment of the quality of the features and their values. We consider data quality from four vantage points: \n",
"\n",
"+ Scope: Do the data match your understanding of the population? \n",
"+ Measurements and Values: Are the values reasonable?\n",
"+ Relationships: Are related features in agreement?\n",
"+ Analysis: Which features might be useful in a future analysis? "
"Scope\n",
": Do the data match your understanding of the population? \n",
"\n",
"Measurements and Values\n",
": Are the values reasonable?\n",
"\n",
"Relationships\n",
": Are related features in agreement?\n",
"\n",
"Analysis\n",
": Which features might be useful in a future analysis? \n",
"\n",
"We describe each of these points in turn, beginning with scope."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quality based on scope \n",
": In {numref}`Chapter %s <ch:data_scope>`, we addressed whether or not the data that have been collected can adequately address the problem at hand. There, we identified the target population, access frame, and sample in collecting the data.\n",
"## Quality based on scope \n",
"\n",
"In {numref}`Chapter %s <ch:data_scope>`, we addressed whether or not the data that have been collected can adequately address the problem at hand. There, we identified the target population, access frame, and sample in collecting the data.\n",
"That framework helps us consider possible limitations that might impact the generalizability of our findings.\n",
"\n",
"While these broader data scope considerations are important\n",
Expand All @@ -81,13 +91,13 @@
{
"data": {
"text/plain": [
"94621 1\n",
"92672 1\n",
"64110 1\n",
"94120 1\n",
"94066 1\n",
" ..\n",
"94621 1\n",
"941033148 1\n",
"941102019 1\n",
"Ca 1\n",
"941 1\n",
"Name: postal_code, Length: 10, dtype: int64"
]
},
Expand All @@ -104,22 +114,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This cross-check with scope helps us spot potential problems."
"This verification using scope helps us spot potential problems."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As another example, a bit of background reading on atmospheric CO2 reveals that typical measurements are about 400 ppm worldwide. So we can check whether the monthly averages of CO2 at Mauna Loa lie between 300 to 450 ppm."
"As another example, a bit of background reading at [climate.gov](https://www.climate.gov/) and [NOAA](https://www.noaa.gov/news-release/carbon-dioxide-now-more-than-50-higher-than-pre-industrial-levels) on the topic of atmospheric CO2 reveals that typical measurements are about 400 ppm worldwide. So we can check whether the monthly averages of CO2 at Mauna Loa lie between 300 to 450 ppm.\n",
"\n",
"Next, we check data values against codebooks and the like. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quality of measurements and recorded values\n",
": We can use also check the quality of measurements by considering what might be a reasonable value for a feature. For example, imagine what might be a reasonable range for the number of violations in a restaurant inspection. Possibly, 0 to 5. Other checks can be based on common knowledge of ranges: a restaurant inspection score must be between 0 and 100; months run between 1 and 12. We can use documentation to tells us the expected values for a feature. For example, the type of emergency room visit in the DAWN survey, introduced in {numref}`Chapter %s <ch:files>`, has been coded as 1, 2, ..., 8 (see {numref}`Figure %s <DAWN_codebook>`). So, we can confirm that all values for the type of visit are indeed integers between 1 and 8."
"## Quality of measurements and recorded values\n",
"\n",
"We can use also check the quality of measurements by considering what might be a reasonable value for a feature. For example, imagine what might be a reasonable range for the number of violations in a restaurant inspection. Possibly, 0 to 5. Other checks can be based on common knowledge of ranges: a restaurant inspection score must be between 0 and 100; months run between 1 and 12. We can use documentation to tells us the expected values for a feature. For example, the type of emergency room visit in the DAWN survey, introduced in {numref}`Chapter %s <ch:files>`, has been coded as 1, 2, ..., 8 (see {numref}`Figure %s <DAWN_codebook>`). So, we can confirm that all values for the type of visit are indeed integers between 1 and 8."
]
},
{
Expand All @@ -131,7 +144,7 @@
"name: DAWN_codebook\n",
"---\n",
"\n",
"Screenshot of the description of the CASETYPE variable in the DAWN survey. Notice that there are eight possible values for this feature. And to help in figuring out if we have properly read the data, we can check the counts for these eight values. \n",
"Screenshot of the description of the CASETYPE variable in the DAWN survey. Notice that there are eight possible values for this feature. And to help in figuring out if we have properly read the data, we can check the counts for these eight values. (The typo SUICICDE appears in the actual codebook.)\n",
"```"
]
},
Expand All @@ -148,8 +161,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Quality across related features\n",
": At times two features have builtin conditions on their values that we can cross-check against other features. For example, according to the documentation for the DAWN study, alcohol consumption is only considered a valid reason for a visit to the ER for patients under 21 so we can check that any record that records alcohol for the type of visit has an age under 21. A cross-tabulation of the features `type` and `age` can confirm this constraint is met."
"## Quality across related features\n",
"\n",
"At times two features have builtin conditions on their values that we can cross-check for internal consistency. For example, according to the documentation for the DAWN study, alcohol consumption is only considered a valid reason for a visit to the ER for patients under 21 so we can check that any record with \"alcohol\" for the type of visit has an age under 21. A cross-tabulation of the features `type` and `age` can confirm this constraint is met:"
]
},
{
Expand Down Expand Up @@ -367,17 +381,23 @@
"source": [
"The cross tabulation confirms that all of the alcohol cases (`type` is 3) have an age under 21 (these are coded as 1, 2, 3, and 4). The data values are as expected. \n",
"\n",
"One last type of quality check pertains to the amount of information found in a feature. \n",
"One last type of quality check pertains to the amount of information found in a feature. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Quality for analysis\n",
"\n",
"Quality for analysis\n",
": Even when data pass the previous quality checks, problems can arise with its usefulness. For example, if all but a handful of values for a feature are identical, then that feature adds little to the understanding of underlying patterns and relationships. Or, if there are too many missing values, especially if there is a discernible pattern in the missing values, our findings may be limited. And, if a feature has many bad/corrupted values, then we might question the accuracy of even those values that fall in the appropriate range."
"Even when data pass the previous quality checks, problems can arise with its usefulness. For example, if all but a handful of values for a feature are identical, then that feature adds little to the understanding of underlying patterns and relationships. Or, if there are too many missing values, especially if there is a discernible pattern in the missing values, our findings may be limited. Plus, if a feature has many bad/corrupted values, then we might question the accuracy of even those values that fall in the appropriate range."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see below that the type of restaurant inspection in San Francisco can be either routine or from a complaint. Since only one of the 14,000+ inspections was from a complaint, we lose little if we drop this feature, and we might also want to drop that single inspection as it represents an anomaly."
"We see below that the type of restaurant inspection in San Francisco can be either routine or from a complaint. Since only one of the 14,000+ inspections was from a complaint, we lose little if we drop this feature, and we might also want to drop that single inspection since it represents an anomaly:"
]
},
{
Expand Down Expand Up @@ -413,29 +433,43 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Essentially you have four options: leave the data as is; modify values; remove features; or drop records. Not every unusual aspect of the data needs to be fixed. You might have discovered a characteristic of your data that will inform you about how to do your analysis and otherwise does not need correcting. Or, you might find that the problem is relatively minor and most likely will not impact your analysis so you can leave the data as is. "
"## Fixing the Data or Not\n",
"\n",
"When you uncover problems with the data, essentially you have four options: leave the data as is; modify values; remove features; or drop records. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Leave it as is\n",
": Not every unusual aspect of the data needs to be fixed. You might have discovered a characteristic of your data that will inform you about how to do your analysis and otherwise does not need correcting. Or, you might find that the problem is relatively minor and most likely will not impact your analysis so you can leave the data as is. Or, you might want to replace corrupted values with `NaN`.\n",
"\n",
"Modify individual values\n",
": If you have figured out what went wrong and can correct the value, then you can opt to change it. In this case, it's good practice to create a new feature with the modified value and preserve the original feature, like in the CO2 example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On the other hand, you might want to replace corrupted values with `NaN`, or you might have figured out what went wrong and correct the value. Other possibilities for modifying records are covered in the examples of {numref}`Chapter %s <ch:eda>`.\n",
"If you plan to change the values of a variable, then it's good practice to create a new feature with the modified value and preserve the original feature, or at a minimum, create a new feature that indicates which values in the original feature have been modified. These approaches give you some flexibility in checking the influence of the modified values on your analysis. "
"Remove a column\n",
": If many values in a feature have problems, then consider eliminating that feature entirely. Rather than exclude a feature, there may be a transformation that allows you to keep the feature while reducing the level of detail recorded."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you find yourself modifying many values in a feature, then you might consider eliminating that feature entirely. Either way, you will want to study the possible impact of excluding the feature from your analysis. In particular, you will want to determine whether the records with corrupted values are similar to each other, and different from the rest of the data. This would indicate that you may be unable to capture the impact of a potentially useful feature in your analysis. Rather than exclude the feature entirely, there may be a transformation that allows you to keep the feature while reducing the level of detail recorded."
"Drop records\n",
": In general, we do not want to drop a large number of observations from a dataset without good reason. Instead, try to scale back your investigation to a particular subgroup of the data that is clearly defined by some criteria, and does not simply correspond dropping records with corrupted values. When you discover that an unusual value is in fact correct, you still might decide to exclude the record from your analysis because it's so different from the rest of your data and you do not want it to overly influence your analysis. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At times, you may want to eliminate the problematic records. In general, we do not want to drop a large number of observations from a dataset without good reason. You may want to scale back your investigation to a particular subgroup of the data, but that's a different situation than dropping records because of a corrupted value in a field. When you discover that an unusual value is in fact correct, you still might decide to exclude the record from your analysis because it's so different from the rest of your data and you do not want it to overly influence your analysis. "
"What ever approach you take, you will want to study the possible impact of the changes that you make on your analysis. For example, try to determine whether the records with corrupted values are similar to each other, and different from the rest of the data."
]
},
{
Expand Down
211 changes: 173 additions & 38 deletions content/ch/09/wrangling_co2.ipynb

Large diffs are not rendered by default.

19 changes: 13 additions & 6 deletions content/ch/09/wrangling_intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@
"analysis. The amount of preparation can vary widely, but there are a few basic steps\n",
"to move from raw data to data ready for analysis. {numref}`Chapter %s <ch:files>`\n",
"addressed the initial steps of creating a data frame from a plain text\n",
"source. In this chapter, we assess quality. We perform validity checks on individual data\n",
"values and entire columns. In addition to checking the quality of the data, we learn\n",
"source. In this chapter, we assess quality. To do this, we perform validity checks on individual data\n",
"values and entire columns. In addition to checking the quality of the data, we determine\n",
"whether or not the data need to be transformed and reshaped to get ready for\n",
"analysis. Quality checking (and fixing) and transformation are often cyclical:\n",
"the quality checks point us toward transformations we need to make, and when we\n",
Expand All @@ -48,7 +48,7 @@
"source": [
"Depending on the data source, we often have different expectations for quality.\n",
"Some datasets require extensive wrangling to get them into an analyzable form,\n",
"and other datasets arrive clean and we can quickly launch into modeling. Below\n",
"and others arrive clean and we can quickly launch into modeling. Below\n",
"are some examples of data sources and how much wrangling we might expect to do. "
]
},
Expand All @@ -64,7 +64,7 @@
" data describing how the data are collected and formatted, and these datasets\n",
" are also typically ready for exploration and analysis right out of the \"box\".\n",
"- Administrative data can be clean, but without inside knowledge of the source\n",
" we often need to extensively check their quality. Also, since we often\n",
" we may need to extensively check their quality. Also, since we often\n",
" use these data for a purpose other than why they were collected in the first place, we\n",
" may need to transform features or combine data tables.\n",
"- Informally collected data, such as data scraped from the Web, can be quite\n",
Expand All @@ -77,7 +77,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this chapter, we break down data wrangling into the following stages: assess data quality; transform features; and reshape the data by modifying its structure and granularity. An important step in assessing the quality of the data is to consider its scope. Data scope was covered in {numref}`Chapter %s <ch:data_scope>`, and we refer you there for a fuller treatment of the topic. "
"In this chapter, we break down data wrangling into the following stages: assess data quality; handle missing values; transform features; and reshape the data by modifying its structure and granularity. An important step in assessing the quality of the data is to consider its scope. Data scope was covered in {numref}`Chapter %s <ch:data_scope>`, and we refer you there for a fuller treatment of the topic. "
]
},
{
Expand All @@ -91,8 +91,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We begin by introducing these data wrangling concepts through an example."
"We use the datasets introduced in {numref}`Chapter %s <ch:files>`: the DAWN government survey of emergency room visits related to drug abuse; and the San Francisco administrative data on food safety inspections of restaurants. But, we begin by introducing the various data wrangling concepts through another example that is simple enough and clean enough that we can limit our focus in each of the wrangling steps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down

0 comments on commit f56e6cf

Please sign in to comment.