Skip to content

Commit

Permalink
adding TR changes to Chapter 8
Browse files Browse the repository at this point in the history
  • Loading branch information
debnolan committed Apr 28, 2023
1 parent d3f5ad8 commit 0638a75
Show file tree
Hide file tree
Showing 8 changed files with 149 additions and 118 deletions.
26 changes: 13 additions & 13 deletions content/ch/08/files_command_line.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Nearly all computers provide access to a *shell interpreter*, such as `sh` or `bash` or `zsh`. These interpreters typically perform operations on the files on a computer, with their own language, syntax, and built-in commands.\n",
"Nearly all computers provide access to a *shell interpreter*, such as `sh` or `bash` or `zsh`. These interpreters typically perform operations on the files on a computer with their own language, syntax, and built-in commands.\n",
"\n",
"We use the term *command-line interface (CLI) tools* to refer to the commands available in a shell interpreter. Although we only cover a few CLI tools here, there are many useful CLI tools that enable all sorts of operations on files. For instance, the following command in the `bash` shell\n",
"produces a list of all the files in the `figures/` folder for this chapter along with their file sizes:\n",
Expand Down Expand Up @@ -62,7 +62,7 @@
"CLI tools often take one or more *arguments*, similar to how Python functions\n",
"take arguments.\n",
"In the shell, we wrap arguments with spaces, not with\n",
"parentheses and commas.\n",
"parentheses or commas.\n",
"The arguments appear at the end of the command line, and they are\n",
"usually the name of a file or some text. In the `ls` example above, the\n",
"argument to `ls` is `figures/`. Additionally, CLI tools support *flags* that\n",
Expand Down Expand Up @@ -97,11 +97,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We begin with an exploration of the file system containing the content for this chapter, using the `ls` tool."
"We begin with an exploration of the file system containing the content for this chapter, using the `ls` tool:"
]
},
{
"cell_type": "markdown",
"cell_type": "raw",
"metadata": {},
"source": [
"```\n",
Expand All @@ -120,7 +120,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To dive deeper and list the files in the `data/` directory, we provide the directory name as an argument to `ls`."
"To dive deeper and list the files in the `data/` directory, we provide the directory name as an argument to `ls`:"
]
},
{
Expand Down Expand Up @@ -158,7 +158,7 @@
"source": [
":::{note}\n",
"\n",
"When working with datasets in this book, our code will often use an additional `-L` flag for `ls` and other CLI tools, such as `du`. We do this because we set up the datasets in our book using shortcuts (called symlinks). Usually, your code won't need the `-L` flag unless you're working with symlinks too. \n",
"When working with data sets in this book, our code will often use an additional `-L` flag for `ls` and other CLI tools, such as `du`. We do this because we set up the data sets in our book using shortcuts (called symlinks). Usually, your code won't need the `-L` flag unless you're working with symlinks too. \n",
"\n",
":::"
]
Expand All @@ -167,7 +167,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Other CLI tools for checking the size of files, are `wc` and `du`. The command `wc` (short for word count) provides helpful information about a file's size in terms of the number of lines, words, and characters in the file."
"Other CLI tools for checking the size of files, are `wc` and `du`. The command `wc` (short for word count) provides helpful information about a file's size in terms of the number of lines, words, and characters in the file:"
]
},
{
Expand All @@ -192,7 +192,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `ls` tool does not calculate the cumulative size of the contents of a folder. To properly calculate the total size of a folder, including the files in the folder, we use `du` (short for disk usage). By default, the `du` tool shows the size in units called blocks."
"The `ls` tool does not calculate the cumulative size of the contents of a folder. To properly calculate the total size of a folder, including the files in the folder, we use `du` (short for disk usage). By default, the `du` tool shows the size in units called blocks:"
]
},
{
Expand All @@ -211,7 +211,7 @@
"metadata": {},
"source": [
"We commonly add the `-s` flag to `du` to show the file sizes for both files and folders and the `-h` flag to display quantities in the standard\n",
"KiB, MiB, GiB format. The asterisk in `data/*` below tells `du` to show the size of every item in the `data` folder."
"KiB, MiB, GiB format. The asterisk in `data/*` below tells `du` to show the size of every item in the `data` folder:"
]
},
{
Expand All @@ -236,7 +236,7 @@
"source": [
"To check the formatting of a file, we can examine the first few lines with the `head` command, or the last few lines with `tail`. These CLIs are very useful for peeking at a\n",
"file's contents to determine whether it's formatted as a CSV, TSV, etc. As an example, let's\n",
"look at the `inspections.csv` file."
"look at the `inspections.csv` file:"
]
},
{
Expand All @@ -263,7 +263,7 @@
"\n",
"We can print the entire file’s contents using the `cat` command. However, you\n",
"should take care when using this command, as printing a large file can cause a crash.\n",
"The `legend.csv` file is small, and we can use `cat` to concatenate and print its contents."
"The `legend.csv` file is small, and we can use `cat` to concatenate and print its contents:"
]
},
{
Expand Down Expand Up @@ -292,7 +292,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, the `file` command can help use determine a file's encoding."
"Finally, the `file` command can help use determine a file's encoding:"
]
},
{
Expand Down Expand Up @@ -351,8 +351,8 @@
"\n",
"Error reduction\n",
": if you want to reduce typographical errors and other simple but potentially harmful mistakes\n",
"Reproducibility\n",
"\n",
"Reproducibility\n",
": if you need to repeat the same process in the future or you\n",
" plan to share your process with others, you have a record of your actions\n",
" \n",
Expand Down
10 changes: 5 additions & 5 deletions content/ch/08/files_datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We have selected two examples to demonstrate file wrangling concepts: a government survey about drug abuse and administrative data from the City of San Francisco about restaurant inspections. Before we start wrangling, we give an overview of the data scope for these examples ({numref}`Chapter %s <ch:data_scope>`)."
"We have selected two examples to demonstrate file wrangling concepts: a government survey about drug abuse; and administrative data from the City of San Francisco about restaurant inspections. Before we start wrangling, we give an overview of the data scope for these examples (see {numref}`Chapter %s <ch:data_scope>`)."
]
},
{
Expand Down Expand Up @@ -62,7 +62,7 @@
"abuse, accidental ingestion, suicide attempts, malicious poisonings, and\n",
"adverse reactions. For each visit, the record may contain up to 16 different drugs, including illegal drugs, prescription drugs, and over-the-counter medications. \n",
"\n",
"The source file for this dataset is an example of fixed-width formatting that rquires a codebook to decipher. Also, it is a reasonably large file and so motivates the topic of how to find a file's size. And the granularity is unusual because an ER visit, not a person, is the subject of investigation. \n",
"The source file for this dataset is an example of fixed-width formatting that requires a external documentation, like a codebook, to decipher. Also, it is a reasonably large file and so motivates the topic of how to find a file's size. And the granularity is unusual because an ER visit, not a person, is the subject of investigation. \n",
"\n",
"The San Francisco restaurant files have other characteristics that make them a good example for this chapter."
]
Expand All @@ -76,7 +76,7 @@
"The [San Francisco Department of Public Health](https://www.sfdph.org/dph/default2.asp) routinely makes unannounced\n",
"visits to restaurants and inspects them for food safety. The inspector\n",
"calculates a score based on the violations found and provides descriptions\n",
"of them. The target population here is all\n",
"of the violations. The target population here is all\n",
"restaurants in the City of San Francisco. These restaurants are accessed\n",
"through a frame of restaurant inspections that were conducted between 2013 and\n",
"2016. Some restaurants have multiple inspections in a year, and not all of the\n",
Expand All @@ -89,7 +89,7 @@
"quality of life and work for residents, employers, employees and visitors.\n",
"\n",
"The City of San Francisco requires restaurants to publicly display their scores\n",
"(see {numref}`Figure %s <scoreCard>` below for an example placard)[^CARDS]. These data offer an example of multiple files with different structures, fields, and granularity. One dataset contains summary results of inspections, another\n",
"(see {numref}`Figure %s <scoreCard>` for an example placard)[^CARDS]. These data offer an example of multiple files with different structures, fields, and granularity. One dataset contains summary results of inspections, another\n",
"provides details about the violations found, and a third\n",
"contains general information about the restaurants. The violations include both serious\n",
"problems related to the transmission of food borne illnesses and minor issues such as not properly displaying the\n",
Expand Down Expand Up @@ -117,7 +117,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Both the DAWN survey data and the San Francisco restaurant inspection data are available online as plain text files. However, their formats are different, and in the next section, we demonstrate how to figure out a file format so that we can read the data into a data frame."
"Both the DAWN survey data and the San Francisco restaurant inspection data are available online as plain text files. However, their formats are quite different, and in the next section, we demonstrate how to figure out a file format so that we can read the data into a data frame."
]
},
{
Expand Down
8 changes: 4 additions & 4 deletions content/ch/08/files_encoding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When we don't know what the encoding is, we have to make a guess. The `chardet`\n",
"When we don't know the encoding, we have to make a guess. The `chardet`\n",
"package has a function called `detect()` that infers a file's encoding.\n",
"Since these guesses are imperfect, the function also returns a confidence\n",
"between 0 and 1. We use this function to look at the files for our examples."
"between 0 and 1. We use this function to look at the files from our examples:"
]
},
{
Expand Down Expand Up @@ -110,7 +110,7 @@
"The detection function is quite certain that all but one of the files are\n",
"ASCII encoded. The exception is `businesses.csv`, which appears to have an ISO-8859-1\n",
"encoding. We run into trouble, if we ignore this encoding and try to read the\n",
"business file into Pandas without specifying the special encoding.\n",
"business file into Pandas without specifying the special encoding:\n",
"\n",
"```python\n",
"# naively reads file without considering encoding\n",
Expand All @@ -125,7 +125,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To successfully read the data, we must specify the ISO-8859-1 encoding."
"To successfully read the data, we must specify the ISO-8859-1 encoding:"
]
},
{
Expand Down

0 comments on commit 0638a75

Please sign in to comment.