Skip to content

Commit

Permalink
Clean up ch08 and ch09
Browse files Browse the repository at this point in the history
  • Loading branch information
SamLau95 committed May 23, 2023
1 parent a6fcb3e commit c6e88d7
Show file tree
Hide file tree
Showing 9 changed files with 2,574 additions and 2,476 deletions.
6 changes: 6 additions & 0 deletions content/_static/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,12 @@ pre {
white-space: pre-wrap;
}

table,
.table {
display: block;
overflow: auto;
}

.table p {
margin-bottom: 0;
}
Expand Down
118 changes: 45 additions & 73 deletions content/ch/03/theory_measurement_error.ipynb

Large diffs are not rendered by default.

57 changes: 45 additions & 12 deletions content/ch/08/files_command_line.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -26,6 +27,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -35,18 +37,22 @@
"produces a list of all the files in the `figures/` folder for this chapter along with their file sizes:\n",
"\n",
"```bash\n",
"ls -l -h figures/\n",
"# The dollar sign is the shell prompt, showing the user where to type. It's\n",
"# not part of the command itself.\n",
"$ ls -l -h figures/\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The basic syntax for a shell command is:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -56,6 +62,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -76,6 +83,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -94,18 +102,20 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We begin with an exploration of the file system containing the content for this chapter, using the `ls` tool:"
]
},
{
"cell_type": "raw",
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"ls\n",
"$ ls\n",
"\n",
"data wrangling_granularity.ipynb\n",
"figures wrangling_intro.ipynb \n",
Expand All @@ -117,18 +127,20 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To dive deeper and list the files in the `data/` directory, we provide the directory name as an argument to `ls`:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"ls -l -L -h data/\n",
"$ ls -l -L -h data/\n",
"\n",
"total 556664\n",
"-rw-r--r-- 1 nolan staff 267M Dec 10 14:03 DAWN-Data.txt\n",
Expand All @@ -141,6 +153,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -153,6 +166,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -164,49 +178,55 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Other CLI tools for checking the size of files, are `wc` and `du`. The command `wc` (short for word count) provides helpful information about a file's size in terms of the number of lines, words, and characters in the file:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"wc data/DAWN-Data.txt\n",
"$ wc data/DAWN-Data.txt\n",
"\n",
" 229211 22695570 280095842 data/DAWN-Data.txt\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see from the output that DAWN-Data.txt has 229211 lines and 280095842 characters. (The middle value is the file's word count, which is useful for files that contain sentences and paragraphs; but, not very useful for files containing data, such as FWF formatted values.)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `ls` tool does not calculate the cumulative size of the contents of a folder. To properly calculate the total size of a folder, including the files in the folder, we use `du` (short for disk usage). By default, the `du` tool shows the size in units called blocks:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"du -L data/\n",
"$ du -L data/\n",
"\n",
"556664\tdata/\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -215,11 +235,12 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"du -Lsh data/*\n",
"$ du -Lsh data/*\n",
"\n",
"267M\tdata/DAWN-Data.txt\n",
"648K\tdata/businesses.csv\n",
Expand All @@ -231,6 +252,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -240,11 +262,12 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"head -4 data/inspections.csv\n",
"$ head -4 data/inspections.csv\n",
"\n",
"\"business_id\",\"score\",\"date\",\"type\"\n",
"19,\"94\",\"20160513\",\"routine\"\n",
Expand All @@ -254,6 +277,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -267,11 +291,12 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"cat data/legend.csv\n",
"$ cat data/legend.csv\n",
"\n",
"\"Minimum_Score\",\"Maximum_Score\",\"Description\"\n",
"0,70,\"Poor\"\n",
Expand All @@ -282,25 +307,28 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In many cases, using `head` or `tail` alone gives us a good enough sense of the file structure to proceed with loading it into a data frame."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, the `file` command can help use determine a file's encoding:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"file -I data/*\n",
"$ file -I data/*\n",
"\n",
"data/DAWN-Data.txt: text/plain; charset=us-ascii\n",
"data/businesses.csv: application/csv; charset=iso-8859-1\n",
Expand All @@ -312,6 +340,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -320,6 +349,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -335,6 +365,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -343,6 +374,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -361,6 +393,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -380,7 +413,7 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -394,7 +427,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
"version": "3.10.11"
}
},
"nbformat": 4,
Expand Down

0 comments on commit c6e88d7

Please sign in to comment.