/
.ipynb
890 lines (890 loc) · 29 KB
/
.ipynb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Berkeley Data Science Modules: Introduction to Data Science in Python\n",
"<img src=\"https://data.berkeley.edu/sites/default/files/styles/openberkeley_brand_widgets_rectangle/public/john_and_ani_community_photo_dsc_7214.jpg?itok=6Hjv1irR\" style=\"width: 500px; height: 350px;\"/>\n",
"\n",
"\n",
"### Table of Contents\n",
"<a href='#section 0'>Welcome to Jupyter Notebooks!</a>\n",
"\n",
"1. <a href='#section 1'>The Python Programming Language</a>\n",
"\n",
" a. <a href='#subsection 1a'>Expressions</a> and <a href='#subsection error'>Errors</a>\n",
"\n",
" b. <a href='#subsection 1b'>Names</a>\n",
"\n",
" c. <a href='#subsection 1c'>Functions</a>\n",
"\n",
" d. <a href='#subsection 1d'>Sequences</a>\n",
"<br><br>\n",
"2. <a href='#section 2'>Tables</a>\n",
"\n",
" a. <a href='#subsection 2a'>Attributes</a>\n",
"\n",
" b. <a href='#subsection 2b'>Transformations</a><br><br>\n",
"\n",
"3. <a href='#section 3'>Coming Soon...</a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Jupyter Notebook <a id='section 0'></a>\n",
"\n",
"Welcome to the Jupyter Notebook! **Notebooks** are documents that can contain text, code, visualizations, and more. \n",
"\n",
"A notebook is composed of rectangular sections called **cells**. There are 2 kinds of cells: markdown and code. A **markdown cell**, such as this one, contains text. A **code cell** contains code in Python, a programming language that we will be using for the remainder of this module. You can select any cell by clicking it once. After a cell is selected, you can navigate the notebook using the up and down arrow keys.\n",
"\n",
"To run a code cell once it's been selected, \n",
"- press Shift-Enter, or\n",
"- click the Run button in the toolbar at the top of the screen. \n",
"\n",
"If a code cell is running, you will see an asterisk (\\*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number will replace the asterisk and any output from the code will appear under the cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# run this cell\n",
"print(\"Hello World!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll notice that many code cells contain lines of blue text that start with a `#`. These are *comments*. Comments often contain helpful information about what the code does or what you are supposed to do in the cell. The leading `#` tells the computer to ignore them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Editing\n",
"\n",
"You can edit a Markdown cell by clicking it twice. Text in Markdown cells is written in [**Markdown**](https://daringfireball.net/projects/markdown/), a formatting syntax for plain text, so you may see some funky symbols when you edit a text cell. \n",
"\n",
"Once you've made your changes, you can exit text editing mode by running the cell. Edit the next cell to fix the misspelling."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Go Baers!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Code cells can be edited any time after they are highlighted. Try editing the next code cell to print your name."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# edit the code to print your name\n",
"print(\"Hello: my name is NAME\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Saving and Loading\n",
"\n",
"Your notebook can record all of your text and code edits, as well as any graphs you generate or calculations you make. You can save the notebook in its current state by clicking Control-S, clicking the floppy disc icon in the toolbar at the top of the page, or by going to the File menu and selecting \"Save and Checkpoint\".\n",
"\n",
"The next time you open the notebook, it will look the same as when you last saved it.\n",
"\n",
"**Note:** after loading a notebook you will see all the outputs (graphs, computations, etc) from your last session, but you won't be able to use any variables you assigned or functions you defined. You can get the functions and variables back by re-running the cells where they were defined- the easiest way is to highlight the cell where you left off work, then go to the Cell menu at the top of the screen and click \"Run all above\". You can also use this menu to run all cells in the notebook by clicking \"Run all\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Completing the Notebooks\n",
"\n",
"As you navigate the notebooks, you'll see cells with bold, all-capitalized headings that need to be filled in to complete the notebook. There are two types:\n",
"<div class=\"alert alert-warning\">\n",
"<b>EXERCISE</b> cells require you to write code to solve a problem, or write short responses related to analyzing a graph or the result of a computation\n",
"</div>\n",
"\n",
"<div class=\"alert alert-info\">\n",
"<b>PRACTICE</b> cells provide spaces to try out new coding skills at your own pace, unrelated to the case study. Since each coding skill taught in these notebooks is necessary for analyzing the cases, practice cells are a good way to get comfortable before applying those skills to real data.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we begin, we'll need a few extra tools to conduct our analysis. Run the next cell to load some code packages that we'll use later. \n",
"\n",
"Note: this cell MUST be run in order for most of the rest of the notebook to work."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# dependencies: THIS CELL MUST BE RUN\n",
"from datascience import *\n",
"import numpy as np\n",
"import math\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('fivethirtyeight')\n",
"import ipywidgets as widgets\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Python <a id='section 1'></a>\n",
"\n",
"**Python** is programming language- a way for us to communicate with the computer and give it instructions. \n",
"\n",
"Just like any language, Python has a *vocabulary* made up of words it can understand, and a *syntax* giving the rules for how to structure communication.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Errors <a id=\"subsection error\"></a>\n",
"\n",
"Python is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:\n",
"1. The rules are *simple*. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.\n",
"2. The rules are *rigid*. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running Python code is not smart enough to do that.\n",
"\n",
"Whenever you write code, you will often accientally break some of these rules. When you run a code cell that doesn't follow every rule exactly, Python will produce an **error message**.\n",
"\n",
"Errors are *normal*; experienced programmers make many errors every day. Errors are also *not dangerous*; you will not break your computer by making an error (in fact, errors are a big part of how you learn a coding language). An error is nothing more than a message from the computer saying it doesn't understand you and asking you to rewrite your command.\n",
"\n",
"We have made an error in the next cell. Run it and see what happens."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"This line is missing something.\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see something like this (minus our annotations):\n",
"\n",
"<img src=\"images/error.jpg\"/>\n",
"\n",
"The last line of the error output attempts to tell you what went wrong. The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure. \"`EOF`\" means \"end of file,\" so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.\n",
"\n",
"There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it. (Of course, if you're frustrated, you can usually find out by searching for the error message online or posting on the Piazza.)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### 1a. Data <a id='subsection 1a'></a>\n",
"**Data** is information- the \"stuff\" we manipulate to make and test hypotheses. \n",
"\n",
"Almost all data you will work with broadly falls into two types: numbers and text. *Numerical data* shows up green in code cells and can be positive, negative, or include a decimal."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Numerical data\n",
"\n",
"4\n",
"\n",
"87623000983\n",
"\n",
"-667\n",
"\n",
"3.14159"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Text data (also called *strings*) shows up red in code cells. Strings are enclosed in double or single quotes. Note that numbers can appear in strings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Strings\n",
"\"a\"\n",
"\n",
"\"Hi there!\"\n",
"\n",
"\"We hold these truths to be self-evident that all men are created equal.\"\n",
"\n",
"# this is a string, NOT numerical data\n",
"\"3.14159\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1a. Expressions <a id='subsection 1a'></a>\n",
"\n",
"A bit of communication in Python is called an **expression**. It tells the computer what to do with the data we give it.\n",
"\n",
"Here's an example of an expression."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# an expression\n",
"14 + 20"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"When you run the cell, the computer **evaluates** the expression and prints the result. Note that only the last line in a code cell will be printed, unless you explicitly tell the computer you want to print the result."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# more expressions. what gets printed and what doesn't?\n",
"100 / 10\n",
"\n",
"print(4.3 + 10.98)\n",
"\n",
"33 - 9 * (40000 + 1)\n",
"\n",
"884"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many basic arithmetic operations are built in to Python, like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). There are many others, which you can find information about [here](http://www.inferentialthinking.com/chapters/03/1/expressions.html). \n",
"\n",
"The computer evaluates arithmetic according to the PEMDAS order of operations (just like you probably learned in middle school): anything in parentheses is done first, followed by exponents, then multiplication and division, and finally addition and subtraction."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# before you run this cell, can you say what it should print?\n",
"4 - 2 * (1 + 6 / 3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"<b>PRACTICE:</b> If you're new to python and coding, one of the best ways to get comfortable is to practice. Try writing and running different expressions in the cell below using numbers and the arithmetic operators `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). See if you can generate different error messages and figure out what they mean.\n",
" </div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Optional: try out different arithmetic operations\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1b. Names <a id='subsection 1b'></a>\n",
"Sometimes, the values you work with can get cumbersome- maybe the expression that gives the value is very complicated, or maybe the value itself is long. In these cases it's useful to give the value a **name**.\n",
"\n",
"We can name values using what's called an *assignment* statement."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# assigns 442 to x\n",
"x = 442"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The assignment statement has three parts. On the left is the *name* (`x`). On the right is the *value* (442). The *equals sign* in the middle tells the computer to assign the value to the name.\n",
"\n",
"You'll notice that when you run the cell with the assignment, it doesn't print anything. But, if we try to access `x` again in the future, it will have the value we assigned it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# print the value of x\n",
"x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also assign names to expressions. The computer will compute the expression and assign the name to the result of the computation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y = 50 * 2 + 1\n",
"y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then use these name as if they were numbers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x - 42"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x + y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"<b>PRACTICE:</b>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Optional: experiment with assigning names and doing arithmetic operations with named variables\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1c. Functions <a id='subsection 1c'></a>\n",
"We've seen that values can have names (often called **variables**), but operations may also have names. A named operation is called a **function**. Python has some functions built into it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# a built-in function \n",
"round"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Functions get used in *call expressions*, where a function is named and given values to operate on inside a set of parentheses. The `round` function returns the number it was given, rounded to the nearest whole number."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# a call expression using round\n",
"round(1988.74699)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A function may also be called on more than one value (called *arguments*). For instance, the `min` function takes however many arguments you'd like and returns the smallest. Multiple arguments are separated by commas."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"min(9, -34, 0, 99)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"<b>PRACTICE</b>\n",
"<ul>\n",
" <li>The `abs` function takes one argument (just like `round`)</li>\n",
" <li>The `max` function takes one or more arguments (just like `min`)</li>\n",
"</ul>\n",
"\n",
"\n",
"Try calling `abs` and `max` in the cell below. What does each function do?\n",
"\n",
"Also try calling each function *incorrectly*, such as with the wrong number of arguments. What kinds of error messages do you see?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# replace the ... with calls to abs and max\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dot Notation\n",
"Python has a lot of [built-in functions](https://docs.python.org/3/library/functions.html) (that is, functions that are already named and defined in Python), but even more functions are stored in collections called *modules*. Earlier, we imported the `math` module so we could use it later. Once a module is imported, you can use its functions by typing the name of the module, then the name of the function you want from it, separated with a `.`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# a call expression with the factorial function from the math module\n",
"math.factorial(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-warning\">\n",
"**EXERCISE:** `math` also has a function called `sqrt` that takes one argument and returns the square root. Call `sqrt` on 16 in the next cell.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use math.sqrt to get the square root of 16\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Tables <a id='section 2'></a>\n",
"\n",
"The last section covered four basic concepts of python: data, expressions, names, and functions. In this next section, we'll see just how much we can do to examine and manipulate our data with only these minimal Python skills."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Tables** are fundamental ways of organizing and displaying data. Run the next cell to load the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ratings = Table.read_table(\"data/imdb_ratings.csv\")\n",
"ratings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This table is organized into **columns**: one for each *category* of information collected:\n",
"\n",
"You can also think about the table in terms of its **rows**. Each row represents all the information collected about a particular instance, which can be a person, location, action, or other unit. \n",
"\n",
"What do the rows in this table represent?\n",
"\n",
"By default only the first ten rows are shown. Can you see how many rows there are in total?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2a. Table Attributes <a id='subsection 2a'></a>\n",
"\n",
"Every table has **attributes** that give information about the table, like the number of rows and the number of columns. Table attributes are accessed using the dot method. But, since an attribute doesn't perform an operation on the table, there are no parentheses (like there would be in a call expression).\n",
"\n",
"Attributes you'll use frequently include `num_rows` and `num_columns`, which give the number of rows and columns in the table, respectively."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get the number of columns\n",
"ratings.num_columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"<b>PRACTICE:</b> Use `num_rows` to get the number of rows in our table.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get the number of rows in the table\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2b. Table Transformation <a id='subsection 2b'></a>\n",
"\n",
"Not all of our columns are relevant to every question we want to ask. We can save computational resources and avoid confusion by *transforming* our table before we start work.\n",
"\n",
"#### Subsetting columns with `select` and `drop`\n",
"The `select` function is used to get a table containing only particular columns. `select` is called on a table using dot notation and takes one or more arguments: the name or names of the column or columns you want."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make a new table with only selected columns\n",
"ratings.select(\"Votes\", \"Title\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If instead you need all columns except a few, the `drop` function can get rid of specified columns. `drop` works very similarly to `select`: call it on the table using dot notation, then give it the name or names of what you want to drop."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# drop a column\n",
"ratings.drop(\"Decade\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-warning\">\n",
"<b>EXERCISE:</b> Pick two columns from our table. Create a new table containing only those two columns two different ways: once using `select` and once using `drop`. \n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use select\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use drop\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Filtering rows with `where`\n",
"Some analysis questions only deal with a subset of rows.\n",
"\n",
"The **`where`** function allows us to choose certain rows based on two arguments:\n",
"- A column label\n",
"- A condition that each row should match, called the _predicate_ \n",
"\n",
"In other words, we call the `where` function like so: `table_name.where(column_name, predicate)`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get a subset of rows\n",
"ratings.where(\"Decade\", are.equal_to(1950))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many types of predicates, but some of the more common ones are:\n",
"\n",
"|Predicate|Example|Result|\n",
"|-|-|-|\n",
"|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|\n",
"|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|\n",
"|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|\n",
"|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|\n",
"|`are.below`|`are.below(50)`|Find rows with values below 50|\n",
"|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# example 2: get a subset of rows\n",
"ratings.where(\"Rank\", are.above(8.7))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-warning\">\n",
"<b> EXERCISE:</b> Describe what happened in each of the two examples above. Which rows were filtered out? Give an example where we would want to use those filters for analysis.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**YOUR RESPONSE HERE**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Coming up... <a id='section 3'></a>\n",
"\n",
"Knowing these few basic concepts about Python and Tables will help you interact with data in upcoming parts of the module. Here's a preview of the kinds of visualizations and operations coming up:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make a bar plot of the max rank per decade\n",
"ratings.select(\"Rank\", \"Decade\").group(\"Decade\", max).barh(\"Decade\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make many bar plots showing feature averages per decade\n",
"def avg_by_decade(feature):\n",
" return ratings.select(feature, \"Decade\").group(\"Decade\", np.average).barh(\"Decade\")\n",
"\n",
"# create the slider sfor the widget\n",
"buttons = widgets.ToggleButtons(options=[\"Rank\", \"Votes\"])\n",
"\n",
"# create the widget to view plots for different parameter values\n",
"display(widgets.interactive(avg_by_decade, feature=buttons))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# load the Capital bike-sharing data set\n",
"bikes = Table.read_table(\"data/day_renamed.csv\")\n",
"bikes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# look at the correlation between temperature and casual rider numbers\n",
"bikes.select(\"casual\", \"temp\").scatter(\"temp\", fit_line=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# compare scatter plots for casual and registered riders for different predictor variables\n",
"def scatter_bikes(predictor, response, fit_line):\n",
" if response == \"both\":\n",
" b = bikes.select(\"registered\", \"casual\", predictor)\n",
" else:\n",
" b = bikes.select(response, predictor)\n",
" return b.scatter(predictor, fit_line=fit_line)\n",
"\n",
"# create the slider sfor the widget\n",
"predict_widget = widgets.Dropdown(options=[\"humidity\", \"windspeed\", \"temp\"],\n",
" value=\"humidity\")\n",
"response_widget = widgets.Dropdown(options=[\"casual\", \"registered\", \"both\"],\n",
" value=\"casual\")\n",
"fitline_widget = widgets.Dropdown(options=[True, False],\n",
" value=False)\n",
"\n",
"# create the widget to view plots for different parameter values\n",
"display(widgets.interactive(scatter_bikes, predictor=predict_widget, response=response_widget, fit_line=fitline_widget))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### References\n",
"\n",
"- Sections of \"Intro to Jupyter\", \"Table Transformation\" adapted from materials by Kelly Chen and Ashley Chien in [UC Berkeley Data Science Modules core resources](http://github.com/ds-modules/core-resources)\n",
"- \"A Note on Errors\" subsection and \"error\" image adapted from materials by Chris Hench and Mariah Rogers for the Medieval Studies 250: Text Analysis for Graduate Medievalists [data science module](https://github.com/ds-modules/MEDST-250).\n",
"- Rocket Fuel data and discussion questions adapted from materials by Zsolt Katona and Brian Bell, BerkeleyHaas Case Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Author: Keeley Takimoto"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}