TR changes and <MP edits to chap 13, 14, and 18

DS-100 · May 8, 2023 · 7c8a383 · 7c8a383
1 parent b710b19
commit 7c8a383
Show file tree

Hide file tree

Showing 18 changed files with 535 additions and 449 deletions.
diff --git a/content/ch/12/pa_intro.ipynb b/content/ch/12/pa_intro.ipynb
@@ -72,7 +72,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In contrast, [PurpleAir](https://www2.purpleair.com/) sensors sell for about \\$250 and can be easily installed at home.\n",
+    "In contrast, [PurpleAir](https://www2.purpleair.com/) sensors, which we first introduced in {numref}`Chapter %s <ch:theory_datadesign>`, sell for about \\$250 and can be easily installed at home.\n",
     "With the lower price point, thousands of people across the US have purchased these sensors for personal use. The sensors can connect to a home WiFi network so the air quality can be easily monitored, and they can report data back to PurpleAir.\n",
     "In 2020, thousands of owners of PurpleAir sensors made publicly available their sensors' measurements.\n",
     "Compared to the AQS sensors, PurpleAir sensors are more timely. They report measurements every two minutes rather than every hour.\n",
@@ -86,7 +86,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this chapter we plan to use the AQS sensor measurements to improve the PurpleAir measurements. It's a big task, and we follow the analysis first developed by [Karoline Barkjohn, Brett Gannt, and Andrea Clements](https://amt.copernicus.org/articles/14/4617/2021/) from the US Environmental Protection Agency.\n",
+    "In this chapter, we plan to use the AQS sensor measurements to improve the PurpleAir measurements. It's a big task, and we follow the analysis first developed by [Karoline Barkjohn, Brett Gannt, and Andrea Clements](https://amt.copernicus.org/articles/14/4617/2021/) from the US Environmental Protection Agency.\n",
     "Barkjohn and group's work was so successful that, as of this writing, the official US government maps, like the AirNow [Fire and Smoke](https://fire.airnow.gov/) map, includes both AQS and PurpleAir sensors, and applies Barkjohn's correction to the PurpleAir data."
    ]
   },

diff --git a/content/ch/13/text_examples.ipynb b/content/ch/13/text_examples.ipynb
@@ -27,7 +27,9 @@
     "motivating example. These examples are based on real tasks that we have carried\n",
     "out, but to focus on the concept, we've reduced the data to snippets.\n",
     "\n",
-    "*Convert text into a standard format.*  Let's say we want to study connections\n",
+    "## Convert text into a standard format  \n",
+    "\n",
+    "Let's say we want to study connections\n",
     "between population demographics and election results.\n",
     "To do this, we've taken election data from Wikipedia and population data from the US Census.\n",
     "The granularity of the data are the county level, and we need to use the county names to join the tables.\n",
@@ -170,7 +172,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "*Extract a piece of text to create a feature.*\n",
+    "## Extract a piece of text to create a feature\n",
+    "\n",
     "Text data sometimes has a lot of structure, especially when it was generated\n",
     "by a computer.\n",
     "As an example, we've displayed a web server's log entry below.\n",
@@ -194,7 +197,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "*Transform text into features.* In {numref}`Chapter %s <ch:wrangling>`, we\n",
+    "## Transform text into features\n",
+    "\n",
+    "In {numref}`Chapter %s <ch:wrangling>`, we\n",
     "created a categorical feature based on the content of the strings. There, we examined the\n",
     "descriptions of restaurant violations and we created nominal variables for the\n",
     "presence of particular words.\n",
@@ -231,7 +236,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "*Text analysis.* Sometimes we want to compare entire documents.\n",
+    "## Text analysis\n",
+    "\n",
+    "Sometimes we want to compare entire documents.\n",
     "For example, the US President gives a State of the Union speech every year. Here are the first few lines of the very first speech:"
    ]
   },

diff --git a/content/ch/13/text_regex.ipynb b/content/ch/13/text_regex.ipynb
@@ -169,6 +169,8 @@
     "id": "JWMIE3Dz3AI4"
    },
    "source": [
+    "These richer patterns are made of character classes and meta characters like wildcards. We describe them here. \n",
+    "\n",
     "Character Classes\n",
     ": We can make patterns more flexible by using a *character class* \n",
     "(also known as a *character set*), which\n",
@@ -402,7 +404,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 6,
    "metadata": {
     "colab": {
      "autoexec": {
@@ -449,7 +451,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Next, we will show how quantifiers can help create a more compact and clear\n",
+    "Next, we show how quantifiers can help create a more compact and clear\n",
     "regular expression for SSNs. "
    ]
   },
@@ -557,7 +559,7 @@
    },
    "source": [
     "A quantifier always modifies the character or character class to its immediate\n",
-    "left. The following table shows the complete syntax for quantifiers."
+    "left. {numref}`Table %s <quantifier-ex>` shows the complete syntax for quantifiers."
    ]
   },
   {
@@ -567,12 +569,17 @@
     "id": "4HXw_UbI3AJY"
    },
    "source": [
+    ":::{table} Quantifier Examples\n",
+    ":name: quantifier-ex\n",
+    "\n",
     "Quantifier | Meaning\n",
     "--- | ---\n",
     "{m, n} | Match the preceding character m to n times.\n",
     "{m} | Match the preceding character exactly m times.\n",
     "{m,} | Match the preceding character at least m times.\n",
-    "{,n} | Match the preceding character at most n times."
+    "{,n} | Match the preceding character at most n times.\n",
+    "\n",
+    ":::"
    ]
   },
   {
@@ -582,14 +589,18 @@
     "id": "h6_CUONr3AJa"
    },
    "source": [
-    "Shorthand Quantifiers\n",
-    ": Some commonly used quantifiers have a shorthand:\n",
+    "Some commonly used quantifiers have a shorthand, as shown in {numref}`Table %s <short-quantifiers>`.\n",
+    "\n",
+    ":::{table} Shorthand Quantifiers\n",
+    ":name: short-quantifiers\n",
     "\n",
     "Symbol | Quantifier | Meaning\n",
     "--- | --- | ---\n",
     " `*` | {0,} | Match the preceding character 0 or more times\n",
     " `+` | {1,} | Match the preceding character 1 or more times\n",
-    " `?` | {0,1} | Match the preceding charcter 0 or 1 times"
+    " `?` | {0,1} | Match the preceding character 0 or 1 times\n",
+    "\n",
+    ":::"
    ]
   },
   {
@@ -599,8 +610,7 @@
     "id": "16ZfQsaqInak"
    },
    "source": [
-    "Quantifiers are greedy\n",
-    ": Quantifiers will return the longest match possible. This sometimes results in\n",
+    "Quantifiers are greedy and will return the longest match possible. This sometimes results in\n",
     "surprising behavior. Since a SSN starts and ends with a digit, we might think\n",
     "the following shorter regex will be a simpler approach for finding SSNs.  Can\n",
     "you figure out what went wrong in the matching?"
@@ -640,27 +650,28 @@
    "metadata": {},
    "outputs": [
     {
-     "ename": "NameError",
-     "evalue": "name 'ssn_re_bdy' is not defined",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
-      "\u001b[0;32m<ipython-input-13-cc6a6737eb7c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mre\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mssn_re_bdy\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'My SSN is 382-34-3842 and hers is 382-34-3333.'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[0;31mNameError\u001b[0m: name 'ssn_re_bdy' is not defined"
-     ]
+     "data": {
+      "text/plain": [
+       "['382-34-3842', '382-34-3333']"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
     }
    ],
    "source": [
-    "re.findall(ssn_re_bdy, 'My SSN is 382-34-3842 and hers is 382-34-3333.')"
+    "re.findall(ssn_re, 'My SSN is 382-34-3842 and hers is 382-34-3333.')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Some platforms allow you to turn off greedy matching and use *lazy* matching, which returns the shortest string.\n",
+    "\n",
     "Literal concatenation and quantifiers are two of the core concepts in regular\n",
-    "expressions. Next, we'll introduce two more core concepts: alternation and grouping."
+    "expressions. Next, we introduce two more core concepts: alternation and grouping."
    ]
   },
   {
@@ -723,10 +734,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Grouping using parentheses\n",
-    ": A set of parentheses specifies a *regex group*, which allows us to locate multiple parts of a pattern.\n",
-    "For example, we can use groups to extract the day, month, year, and time from\n",
-    "the web server log entry."
+    "With parentheses we can locate parts of a pattern, which are called *regex groups*. For example, we can use regex groups to extract the day, month, year, and time from the web server log entry."
    ]
   },
   {
@@ -810,7 +818,7 @@
    "source": [
     "The four basic operations for regular expressions, concatenation, quantifying,\n",
     "alternation, and grouping have an order of precedence, which we make explicit\n",
-    "in the table below. \n",
+    "in {numref}`Table %s <regex-order>`. \n",
     "\n",
     ":::{table} Order of Operaions\n",
     ":name: regex-order\n",
@@ -832,7 +840,7 @@
     "id": "TrPjIHHD8rOC"
    },
    "source": [
-    "The following table provides a list of the meta characters introduced in this\n",
+    "{numref}`Table %s <regex-meta>` provides a list of the meta characters introduced in this\n",
     "section, plus a few more. The column labeled \"Doesn't Match\" gives examples of\n",
     "strings that the example regexes don't match.\n",
     "\n",
@@ -865,7 +873,7 @@
     "id": "TrPjIHHD8rOC"
    },
    "source": [
-    "Additionally, we provide a table of shorthands for some commonly used character\n",
+    "Additionally, in {numref}`Table %s <regex-shorthand>`, we provide shorthands for some commonly used character\n",
     "sets. These shorthands don't need `[ ]`.\n",
     "\n",
     ":::{table} Character Class Shorthands \n",
@@ -894,7 +902,7 @@
     "of a pattern with a substring, and *split* a string into pieces at the pattern.\n",
     "Each, requires a pattern and string to be specified, and some have\n",
     "extra arguments.\n",
-    "The table below provides the format of the method usage and a\n",
+    "{numref}`Table %s <regex-methods>` provides the format of the method usage and a\n",
     "description of the return value. "
    ]
   },
@@ -920,11 +928,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Regex and pandas\n",
-    ": As seen in the previous section, `pandas` Series objects have a `.str` property\n",
+    "As we saw in the previous section, `pandas` Series objects have a `.str` property\n",
     "that supports string manipulation using Python string methods. Conveniently,\n",
-    "the `.str` property also supports some functions from the `re` module. The\n",
-    "table below shows the analogous functionality from the above table of the `re`\n",
+    "the `.str` property also supports some functions from the `re` module. {numref}`Table %s <regex-pandas>` shows the analogous functionality from {numref}`Table %s <regex-methods>` of the `re`\n",
     "methods. Each requires a pattern.\n",
     "See [the `pandas` docs][pd_str] for a complete list of \n",
     "string methods. \n",
@@ -957,6 +963,7 @@
     "\n",
     "+ Develop your regular expression on simple test strings to see what the pattern matches.\n",
     "+ If a pattern matches nothing, try weakening it by dropping part of the pattern. Then tighten it incrementally to see how the matching evolves. (Online regex checking tools can be very helpful here).\n",
+    "+ Make the pattern only as specific as it needs to be for the data at hand.\n",
     "+ Use raw strings whenever possible for cleaner patterns, especially when a pattern includes a backslash. \n",
     "+ When you have lots of long strings, consider using compiled patterns because they can be faster to match (see `compile()` in the `re` library)."
    ]
@@ -966,10 +973,17 @@
    "metadata": {},
    "source": [
     "In the next section, we carry out an example text analysis.\n",
-    "We'll clean the data\n",
+    "We clean the data\n",
     "using regular expressions and string manipulation, convert the text into\n",
     "quantitative data, and analyze the text via these derived quantities."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {

diff --git a/content/ch/13/text_sotu.ipynb b/content/ch/13/text_sotu.ipynb
diff --git a/content/ch/13/text_strings.ipynb b/content/ch/13/text_strings.ipynb
@@ -72,7 +72,9 @@
     "\n",
     "We show how we can combine these basic operations to clean up the county names data.\n",
     "Remember that we have two tables that we want to join, but the county names are\n",
-    "written inconsistently."
+    "written inconsistently.\n",
+    "\n",
+    "Let's start by converting the county names to a standard format."
    ]
   },
   {

diff --git a/content/ch/13/text_summary.ipynb b/content/ch/13/text_summary.ipynb
@@ -31,7 +31,16 @@
    "metadata": {},
    "source": [
     "While powerful, regular expressions are terrible at these types of tasks.\n",
-    "However, all in all, in our experience, even the basics of text analysis can enable all sorts of interesting analyses---a little bit goes a long way."
+    "However, all in all, in our experience, even the basics of text manipulation can enable all sorts of interesting analyses---a little bit goes a long way.\n",
+    "\n",
+    "We have one final caution about regular expressions: they can be computationally expensive. You will want to consider the trade-offs between these concise clear expressions and the overhead they create if they're being put into production code."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The next chapter considers other sorts of data, such as data in binary formats, and highly structured text of JSON and HTML. Our focus will be on loading these data into data frames and other Python data structures.  "
    ]
   },
   {