-
Notifications
You must be signed in to change notification settings - Fork 79
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
42 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,97 +1,96 @@ | ||
(extra_reading)= | ||
# Additional Material | ||
|
||
We cover a lot of topics in this book, and if you want to learn more about any of them, we have collected a list of additional resources, including textbooks and online tutorials. | ||
We cover a lot of topics in this book, and if you want to learn more about them, we have collected a list of additional resources that include textbooks and online tutorials. | ||
|
||
For an overview of the larger themes in this book see: | ||
More in-depth treatments of the larger themes in this book can be found in the following resources. | ||
|
||
+ [*Sampling: Design and Analysis*](https://doi.org/10.1201/9780429298899) by Lohr for topics in scientific sampling; | ||
+ The sampling topics introduced in this book and several more, can be found in [*Sampling: Design and Analysis*](https://doi.org/10.1201/9780429298899) by Lohr. Lohr also contains a treatment of the population, access frame, sampling methods, and sources of bias. | ||
|
||
+ [*Statistics*](https://wwnorton.com/books/Statistics/) by Freedman, Pisani, and Purves is useful for introductory statistics related to the urn model; | ||
+ For an introductory treatment of the urn model, confidence intervals, and hypothesis tests, we recommend [*Statistics*](https://wwnorton.com/books/Statistics/) by Freedman, Pisani, and Purves. | ||
|
||
+ [*Probability*](https://doi.org/10.1007/978-1-4612-4374-8) by Pitman and [*Introduction to Probaqbility*](https://doi.org/10.1201/b17221) by Hwang and Blitzstein for a more mathematical treatment of probability; | ||
+ A more mathematical treatment of probability that is still introductory we suggest [*Probability*](https://doi.org/10.1007/978-1-4612-4374-8) by Pitman and [*Introduction to Probability*](https://doi.org/10.1201/b17221) by Hwang and Blitzstein. | ||
|
||
+ [*Principles of Data Wrangling: Practical Techniques for Data Preparation*](https://www.oreilly.com/library/view/principles-of-data/9781491938911/) by Rattenbury, Hellerstein, Heer, Kandel, and Carreras for more on data wrangling; | ||
+ A resource for data wrangling is [*Principles of Data Wrangling: Practical Techniques for Data Preparation*](https://www.oreilly.com/library/view/principles-of-data/9781491938911/) by Rattenbury, Hellerstein, Heer, Kandel, and Carreras. Many of the organizational topics of wrangling stem from this resource. | ||
|
||
+ [* *]() by for Pandas | ||
|
||
+ [* *]() by for SQL | ||
+ SQL [*The Essence of Databases*](https://dl.acm.org/doi/book/10.5555/274800) by Roland. W3 School [Introduction to SQL](https://www.w3schools.com/sql/sql_intro.asp) | ||
|
||
+ [*Exploratory Data Analysis*](https://archive.org/details/exploratorydataa00tuke_0) by Tukey for EDA; | ||
+ The original test by Tukey,[*Exploratory Data Analysis*](https://archive.org/details/exploratorydataa00tuke_0) offers an introduction to the topic. A more modern treatment can be found in XXXX. | ||
|
||
+ [*Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures*](https://clauswilke.com/dataviz/) by Wilke for more on visualization; | ||
+ See [*Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures*](https://clauswilke.com/dataviz/) by Wilke for more on visualization. Our guidelines do not entirely match Wilke's but they come close and it's helpful to see a variety of opinions on the topic. | ||
|
||
+ [*Linear Models with Python*](https://julianfaraway.github.io/LMP/) by Faraway, [*Applied Regression Analysis and Generalized Linear Models*](https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254) by Fox, [*An Introduction to Statistical Learning: With Applications in Python*](https://www.statlearning.com/) by James, Witten, Hastie, Tibshirani, and Taylor, and [*Applied Linear Regression*](https://doi.org/10.1002/0471704091) by Weisberg for more on modeling, transformations, bootstrap, and regularization. | ||
+ The many topics on modeling, including transformations, one-hot encoding, model-selection, cross-validation, and regularization can be found in several sources. We recommend: [*Linear Models with Python*](https://julianfaraway.github.io/LMP/) by Faraway, [*Applied Regression Analysis and Generalized Linear Models*](https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254) by Fox, [*An Introduction to Statistical Learning: With Applications in Python*](https://www.statlearning.com/) by James, Witten, Hastie, Tibshirani, and Taylor, and [*Applied Linear Regression*](https://doi.org/10.1002/0471704091) by Weisberg. | ||
|
||
+ [*Mathematical Statistics and Data Analysis*](https://www.cengage.com/c/mathematical-statistics-and-data-analysis-3e-rice/9780534399429/) by Rice for more on confidence intervals and testing. | ||
+ A more formal treatment of confidence intervals, prediction intervals, testing, and the bootstrap can be found in [*Mathematical Statistics and Data Analysis*](https://www.cengage.com/c/mathematical-statistics-and-data-analysis-3e-rice/9780534399429/). | ||
|
||
+ [*Monte Carlo theory, methods and examples*](https://artowen.su.domains/mc/) by Owen to learn more about simulation; | ||
+ Owen's online text, [*Monte Carlo theory, methods and examples*](https://artowen.su.domains/mc/) provides a solid introduction to simulation. | ||
|
||
+ [*Programming Collective Intelligence*](https://www.oreilly.com/library/view/programming-collective-intelligence/9780596529321/) by Segaran for more on optimization. | ||
+ [*Programming Collective Intelligence*](https://www.oreilly.com/library/view/programming-collective-intelligence/9780596529321/) by Segaran covers the topic of optimization. | ||
|
||
+ [*Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning*](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/) by Bengfort, Bilbro, and Ojeda for more on text analysis. | ||
+ See [*Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning*](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/) by Bengfort, Bilbro, and Ojeda for more on text analysis. | ||
|
||
In addition, we provide a list of references for several smaller topics and topics that were lightly touched upon. | ||
In addition, we provide a list of resources for many smaller topics and for topics that were lightly touched upon. | ||
|
||
+ To learn more about the interplay between questions and data, we recommend [Questions, Answers, and Statistics](https://iase-web.org/documents/papers/icots2/Speed.pdf) by Speed. | ||
+ To learn more about the interplay between questions and data, we recommend [Questions, Answers, and Statistics](https://iase-web.org/documents/papers/icots2/Speed.pdf) by Speed. In addition Leek and Peng connect questions with the type of analysis in [What is the question? Mistaking the type of question being considered is the most common error in data analysis](https://doi.org/10.1126/science.aaa6146). | ||
|
||
+ To learn more about how to analyze data with a time domain, we refer you to [*Time Series Analysis and Its Applications*](https://doi.org/10.1007/978-3-319-52452-8) by Shumway and Stoffer. | ||
+ The broad topic of how to analyze time-series data, we refer you to [*Time Series Analysis and Its Applications*](https://doi.org/10.1007/978-3-319-52452-8) by Shumway and Stoffer. | ||
|
||
+ Ethics | ||
+ To learn more about the human contexts and ethics of data, see the [HCE Toolkit](https://data.berkeley.edu/hce-toolkit) and Tuskegee University's [National Center for Bioethics in Research and Health Care](https://www.tuskegee.edu/about-us/centers-of-excellence/bioethics-center). | ||
|
||
+ Data privacy - SAM? | ||
|
||
+ A proof that the median minimizes absolute error can be found in [*Mathematical Statistics: Basic Ideas and Selected Topics Volume I*](https://www.routledge.com/Mathematical-Statistics-Basic-Ideas-and-Selected-Topics-Volume-I-Second/Bickel-Doksum/p/book/9781498723800) by Bickel and Doksum. | ||
|
||
+ For more information about how to handle missing data, see [*Statistical Analysis with Missing Data*](https://www.wiley.com/en-us/Statistical+Analysis+with+Missing+Data,+3rd+Edition-p-9780470526798) by Little and Rubin. | ||
|
||
+ The smooth density curve is covered in greater detail in [*Density Estimation for Statistics and Data Analysis*](https://www.routledge.com/Density-Estimation-for-Statistics-and-Data-Analysis/Silverman/p/book/9780412246203) by Silverman. | ||
+ The smooth density curve is covered in detail in [*Density Estimation for Statistics and Data Analysis*](https://www.routledge.com/Density-Estimation-for-Statistics-and-Data-Analysis/Silverman/p/book/9780412246203) by Silverman. | ||
|
||
+ For more information on color palettes see Brewer's [ColorBrewer2.0](https://colorbrewer2.org/). | ||
+ For more information on color palettes see Brewer's online [ColorBrewer2.0](https://colorbrewer2.org/). | ||
|
||
+ An in-depth treatment of loss functions can be found in Chapter 12 of [*All of Statistics: A Concise Course in Statistical Inference*](https://doi.org/10.1007/978-0-387-21736-9) by Wasserman. | ||
+ An in-depth treatment of loss functions and risk can be found in Chapter 12 of [*All of Statistics: A Concise Course in Statistical Inference*](https://doi.org/10.1007/978-0-387-21736-9) by Wasserman. | ||
|
||
+ See [Statistical Calibration: A Review](https://doi.org/10.2307/1403690) by Osborne for more on calibration. | ||
|
||
+ Chapter 10 in Fox gives an informative treatment of vector geometry of least squares. | ||
|
||
+ Chapter 13 in Fox and Chapter 10 in James et al cover Principal Components. | ||
|
||
+ Chapter 14 in Fox covers the maximum likelihood approach to logistic regression. | ||
|
||
+ Chapter 4 in James et al covers sensitivity and specificity in more detail. | ||
|
||
+ For practice with regular expressions there are many on-line resources such as the W3 Schools tutorial [Python RegEx] (https://www.w3schools.com/python/python_regex.asp), regular expression checkers like [Regular Expressions 101](https://regex101.com/), introductions to the topics as with [An introduction to regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) by Nield, and texts like [*Mastering Regular Expressions*](https://dl.acm.org/doi/10.5555/1209014) by Friedl. | ||
|
||
Regular expressions | ||
|
||
PCAA | ||
|
||
netCDF | ||
|
||
Parquet | ||
|
||
http REST | ||
+ For an online tutorial on how to work with netCDF climate data see [The Beauty of NetCDF](https://www.youtube.com/watch?v=UvNBnjiTXa0) | ||
by Tompkins. | ||
|
||
Risk | ||
+ There are many resources on web services, such as HTTP and REST. Some accessible introductory material can be found at [*RESTful Web Services*](https://dl.acm.org/doi/10.5555/1406352) | ||
by Richardson and Ruby. | ||
|
||
CV | ||
+ For more on broken-stick regression see [Bent-Cable Regression Theory and Applications](https://doi.org/10.1198/016214505000001177) by Chiu, Lockhart and Routledge. | ||
|
||
Broken stick regression | ||
+ For an interesting read, see Andrew Ng's [interview](https://spectrum.ieee.org/andrew-ng-xrays-the-ai-hype) on the gap between test sets and real world use. | ||
|
||
Polynomial regression | ||
+ Chapter 7 of James et al introduces polynomial regression using orthogonal polynomials. | ||
|
||
Bias-variance decomposition | ||
+ Information about rank tests and other nonparametric statistics can be found in [*Nonparametric Rank Tests*](https://doi.org/10.1007/978-3-642-04898-2_417_) by Hettmansperger. | ||
|
||
Rank tests | ||
+ The [The ASA Statement on p-Values: Context, Process, and Purpose](https://doi.org/10.1080/00031305.2016.1154108) by Wasserstein and Lazar provides valuable insights into how to interpret $p$-values. Additionally, the topic of p-hacking is addressed in [The Statistical Crisis in Science](https://doi.org/10.1511/2014.111.460) by Gelman and Loken. | ||
|
||
Faraway cautions | ||
+ For a fun explanation of confounding variables see the [xkcd cartoon](https://www.explainxkcd.com/wiki/index.php/2560:_Confounding_Variables) and its explanation. | ||
|
||
P-value ASA | ||
+ For more on XML, we recommend [*XML and Web Technologies for Data Sciences with R*](https://doi.org/10.1007/978-1-4614-7900-0) by Nolan and Temple Lang. | ||
|
||
P-hacking | ||
+ For more on the technique for simple models to use in the field, see [The lost art of nomography](https://deadreckonings.files.wordpress.com/2008/01/nomography.pdf) by Doerfler. | ||
|
||
Prediction intervals | ||
+ Simpson's paradox | ||
|
||
AB testing | ||
+ Weighted Regression | ||
|
||
Donkey field | ||
+ Reproducible research | ||
|
||
Data privacy | ||
+ An informative talk by Ramdas on bias, Simpson's paradox, p-hacking, and other topics see the [screencast](https://www.youtube.com/watch?v=wGcjGH-zIL4) and [slides](https://drive.google.com/file/d/0B7gkaDYGT5X5c245RV93MVRRSjQ/view?resourcekey=0-8nQDM50Tta2SuLkFqAXEqQ). | ||
|
||
|