Skip to content

Commit

Permalink
TR changes and <MP edits to chap 13, 14, and 18
Browse files Browse the repository at this point in the history
  • Loading branch information
debnolan committed May 8, 2023
1 parent b710b19 commit 7c8a383
Show file tree
Hide file tree
Showing 18 changed files with 535 additions and 449 deletions.
4 changes: 2 additions & 2 deletions content/ch/12/pa_intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In contrast, [PurpleAir](https://www2.purpleair.com/) sensors sell for about \\$250 and can be easily installed at home.\n",
"In contrast, [PurpleAir](https://www2.purpleair.com/) sensors, which we first introduced in {numref}`Chapter %s <ch:theory_datadesign>`, sell for about \\$250 and can be easily installed at home.\n",
"With the lower price point, thousands of people across the US have purchased these sensors for personal use. The sensors can connect to a home WiFi network so the air quality can be easily monitored, and they can report data back to PurpleAir.\n",
"In 2020, thousands of owners of PurpleAir sensors made publicly available their sensors' measurements.\n",
"Compared to the AQS sensors, PurpleAir sensors are more timely. They report measurements every two minutes rather than every hour.\n",
Expand All @@ -86,7 +86,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this chapter we plan to use the AQS sensor measurements to improve the PurpleAir measurements. It's a big task, and we follow the analysis first developed by [Karoline Barkjohn, Brett Gannt, and Andrea Clements](https://amt.copernicus.org/articles/14/4617/2021/) from the US Environmental Protection Agency.\n",
"In this chapter, we plan to use the AQS sensor measurements to improve the PurpleAir measurements. It's a big task, and we follow the analysis first developed by [Karoline Barkjohn, Brett Gannt, and Andrea Clements](https://amt.copernicus.org/articles/14/4617/2021/) from the US Environmental Protection Agency.\n",
"Barkjohn and group's work was so successful that, as of this writing, the official US government maps, like the AirNow [Fire and Smoke](https://fire.airnow.gov/) map, includes both AQS and PurpleAir sensors, and applies Barkjohn's correction to the PurpleAir data."
]
},
Expand Down
15 changes: 11 additions & 4 deletions content/ch/13/text_examples.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@
"motivating example. These examples are based on real tasks that we have carried\n",
"out, but to focus on the concept, we've reduced the data to snippets.\n",
"\n",
"*Convert text into a standard format.* Let's say we want to study connections\n",
"## Convert text into a standard format \n",
"\n",
"Let's say we want to study connections\n",
"between population demographics and election results.\n",
"To do this, we've taken election data from Wikipedia and population data from the US Census.\n",
"The granularity of the data are the county level, and we need to use the county names to join the tables.\n",
Expand Down Expand Up @@ -170,7 +172,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*Extract a piece of text to create a feature.*\n",
"## Extract a piece of text to create a feature\n",
"\n",
"Text data sometimes has a lot of structure, especially when it was generated\n",
"by a computer.\n",
"As an example, we've displayed a web server's log entry below.\n",
Expand All @@ -194,7 +197,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*Transform text into features.* In {numref}`Chapter %s <ch:wrangling>`, we\n",
"## Transform text into features\n",
"\n",
"In {numref}`Chapter %s <ch:wrangling>`, we\n",
"created a categorical feature based on the content of the strings. There, we examined the\n",
"descriptions of restaurant violations and we created nominal variables for the\n",
"presence of particular words.\n",
Expand Down Expand Up @@ -231,7 +236,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*Text analysis.* Sometimes we want to compare entire documents.\n",
"## Text analysis\n",
"\n",
"Sometimes we want to compare entire documents.\n",
"For example, the US President gives a State of the Union speech every year. Here are the first few lines of the very first speech:"
]
},
Expand Down
80 changes: 47 additions & 33 deletions content/ch/13/text_regex.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,8 @@
"id": "JWMIE3Dz3AI4"
},
"source": [
"These richer patterns are made of character classes and meta characters like wildcards. We describe them here. \n",
"\n",
"Character Classes\n",
": We can make patterns more flexible by using a *character class* \n",
"(also known as a *character set*), which\n",
Expand Down Expand Up @@ -402,7 +404,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 6,
"metadata": {
"colab": {
"autoexec": {
Expand Down Expand Up @@ -449,7 +451,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will show how quantifiers can help create a more compact and clear\n",
"Next, we show how quantifiers can help create a more compact and clear\n",
"regular expression for SSNs. "
]
},
Expand Down Expand Up @@ -557,7 +559,7 @@
},
"source": [
"A quantifier always modifies the character or character class to its immediate\n",
"left. The following table shows the complete syntax for quantifiers."
"left. {numref}`Table %s <quantifier-ex>` shows the complete syntax for quantifiers."
]
},
{
Expand All @@ -567,12 +569,17 @@
"id": "4HXw_UbI3AJY"
},
"source": [
":::{table} Quantifier Examples\n",
":name: quantifier-ex\n",
"\n",
"Quantifier | Meaning\n",
"--- | ---\n",
"{m, n} | Match the preceding character m to n times.\n",
"{m} | Match the preceding character exactly m times.\n",
"{m,} | Match the preceding character at least m times.\n",
"{,n} | Match the preceding character at most n times."
"{,n} | Match the preceding character at most n times.\n",
"\n",
":::"
]
},
{
Expand All @@ -582,14 +589,18 @@
"id": "h6_CUONr3AJa"
},
"source": [
"Shorthand Quantifiers\n",
": Some commonly used quantifiers have a shorthand:\n",
"Some commonly used quantifiers have a shorthand, as shown in {numref}`Table %s <short-quantifiers>`.\n",
"\n",
":::{table} Shorthand Quantifiers\n",
":name: short-quantifiers\n",
"\n",
"Symbol | Quantifier | Meaning\n",
"--- | --- | ---\n",
" `*` | {0,} | Match the preceding character 0 or more times\n",
" `+` | {1,} | Match the preceding character 1 or more times\n",
" `?` | {0,1} | Match the preceding charcter 0 or 1 times"
" `?` | {0,1} | Match the preceding character 0 or 1 times\n",
"\n",
":::"
]
},
{
Expand All @@ -599,8 +610,7 @@
"id": "16ZfQsaqInak"
},
"source": [
"Quantifiers are greedy\n",
": Quantifiers will return the longest match possible. This sometimes results in\n",
"Quantifiers are greedy and will return the longest match possible. This sometimes results in\n",
"surprising behavior. Since a SSN starts and ends with a digit, we might think\n",
"the following shorter regex will be a simpler approach for finding SSNs. Can\n",
"you figure out what went wrong in the matching?"
Expand Down Expand Up @@ -640,27 +650,28 @@
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'ssn_re_bdy' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-13-cc6a6737eb7c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mre\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mssn_re_bdy\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'My SSN is 382-34-3842 and hers is 382-34-3333.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'ssn_re_bdy' is not defined"
]
"data": {
"text/plain": [
"['382-34-3842', '382-34-3333']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"re.findall(ssn_re_bdy, 'My SSN is 382-34-3842 and hers is 382-34-3333.')"
"re.findall(ssn_re, 'My SSN is 382-34-3842 and hers is 382-34-3333.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some platforms allow you to turn off greedy matching and use *lazy* matching, which returns the shortest string.\n",
"\n",
"Literal concatenation and quantifiers are two of the core concepts in regular\n",
"expressions. Next, we'll introduce two more core concepts: alternation and grouping."
"expressions. Next, we introduce two more core concepts: alternation and grouping."
]
},
{
Expand Down Expand Up @@ -723,10 +734,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Grouping using parentheses\n",
": A set of parentheses specifies a *regex group*, which allows us to locate multiple parts of a pattern.\n",
"For example, we can use groups to extract the day, month, year, and time from\n",
"the web server log entry."
"With parentheses we can locate parts of a pattern, which are called *regex groups*. For example, we can use regex groups to extract the day, month, year, and time from the web server log entry."
]
},
{
Expand Down Expand Up @@ -810,7 +818,7 @@
"source": [
"The four basic operations for regular expressions, concatenation, quantifying,\n",
"alternation, and grouping have an order of precedence, which we make explicit\n",
"in the table below. \n",
"in {numref}`Table %s <regex-order>`. \n",
"\n",
":::{table} Order of Operaions\n",
":name: regex-order\n",
Expand All @@ -832,7 +840,7 @@
"id": "TrPjIHHD8rOC"
},
"source": [
"The following table provides a list of the meta characters introduced in this\n",
"{numref}`Table %s <regex-meta>` provides a list of the meta characters introduced in this\n",
"section, plus a few more. The column labeled \"Doesn't Match\" gives examples of\n",
"strings that the example regexes don't match.\n",
"\n",
Expand Down Expand Up @@ -865,7 +873,7 @@
"id": "TrPjIHHD8rOC"
},
"source": [
"Additionally, we provide a table of shorthands for some commonly used character\n",
"Additionally, in {numref}`Table %s <regex-shorthand>`, we provide shorthands for some commonly used character\n",
"sets. These shorthands don't need `[ ]`.\n",
"\n",
":::{table} Character Class Shorthands \n",
Expand Down Expand Up @@ -894,7 +902,7 @@
"of a pattern with a substring, and *split* a string into pieces at the pattern.\n",
"Each, requires a pattern and string to be specified, and some have\n",
"extra arguments.\n",
"The table below provides the format of the method usage and a\n",
"{numref}`Table %s <regex-methods>` provides the format of the method usage and a\n",
"description of the return value. "
]
},
Expand All @@ -920,11 +928,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Regex and pandas\n",
": As seen in the previous section, `pandas` Series objects have a `.str` property\n",
"As we saw in the previous section, `pandas` Series objects have a `.str` property\n",
"that supports string manipulation using Python string methods. Conveniently,\n",
"the `.str` property also supports some functions from the `re` module. The\n",
"table below shows the analogous functionality from the above table of the `re`\n",
"the `.str` property also supports some functions from the `re` module. {numref}`Table %s <regex-pandas>` shows the analogous functionality from {numref}`Table %s <regex-methods>` of the `re`\n",
"methods. Each requires a pattern.\n",
"See [the `pandas` docs][pd_str] for a complete list of \n",
"string methods. \n",
Expand Down Expand Up @@ -957,6 +963,7 @@
"\n",
"+ Develop your regular expression on simple test strings to see what the pattern matches.\n",
"+ If a pattern matches nothing, try weakening it by dropping part of the pattern. Then tighten it incrementally to see how the matching evolves. (Online regex checking tools can be very helpful here).\n",
"+ Make the pattern only as specific as it needs to be for the data at hand.\n",
"+ Use raw strings whenever possible for cleaner patterns, especially when a pattern includes a backslash. \n",
"+ When you have lots of long strings, consider using compiled patterns because they can be faster to match (see `compile()` in the `re` library)."
]
Expand All @@ -966,10 +973,17 @@
"metadata": {},
"source": [
"In the next section, we carry out an example text analysis.\n",
"We'll clean the data\n",
"We clean the data\n",
"using regular expressions and string manipulation, convert the text into\n",
"quantitative data, and analyze the text via these derived quantities."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
25 changes: 7 additions & 18 deletions content/ch/13/text_sotu.ipynb

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion content/ch/13/text_strings.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,9 @@
"\n",
"We show how we can combine these basic operations to clean up the county names data.\n",
"Remember that we have two tables that we want to join, but the county names are\n",
"written inconsistently."
"written inconsistently.\n",
"\n",
"Let's start by converting the county names to a standard format."
]
},
{
Expand Down
11 changes: 10 additions & 1 deletion content/ch/13/text_summary.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,16 @@
"metadata": {},
"source": [
"While powerful, regular expressions are terrible at these types of tasks.\n",
"However, all in all, in our experience, even the basics of text analysis can enable all sorts of interesting analyses---a little bit goes a long way."
"However, all in all, in our experience, even the basics of text manipulation can enable all sorts of interesting analyses---a little bit goes a long way.\n",
"\n",
"We have one final caution about regular expressions: they can be computationally expensive. You will want to consider the trade-offs between these concise clear expressions and the overhead they create if they're being put into production code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next chapter considers other sorts of data, such as data in binary formats, and highly structured text of JSON and HTML. Our focus will be on loading these data into data frames and other Python data structures. "
]
},
{
Expand Down

0 comments on commit 7c8a383

Please sign in to comment.