Skip to content

Commit

Permalink
Update additional resources
Browse files Browse the repository at this point in the history
  • Loading branch information
SamLau95 committed May 29, 2023
1 parent 0a4fabe commit 9811b9f
Showing 1 changed file with 42 additions and 46 deletions.
88 changes: 42 additions & 46 deletions content/additional_resources.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,84 @@
(extra_reading)=
# Additional Material

Collected here are a variety of resources for a more in-depth treatment of the larger themes in this book. In addition to recommendations for these broad topics, we provide a list of resources for several smaller topics and big topics that we only lightly touched on. These resources are organized in the order in which the topics appear in the book.

+ For how to analyze time-series data, like the Google Flu trends, we refer you to [*Time Series Analysis and Its Applications*](https://doi.org/10.1007/978-3-319-52452-8) by Shumway and Stoffer.

+ To learn more about the interplay between questions and data, we recommend [Questions, Answers, and Statistics](https://iase-web.org/documents/papers/icots2/Speed.pdf) by Speed. In addition, Leek and Peng connect questions with the type of analysis needed in [What is the question? Mistaking the type of question being considered is the most common error in data analysis](https://doi.org/10.1126/science.aaa6146).
# Additional Material

+ More on sampling topics can be found in [*Sampling: Design and Analysis*](https://doi.org/10.1201/9780429298899) by Lohr. Lohr also contains a treatment of the target population, access frame, and sampling methods, and sources of bias.

+ To learn more about the human contexts and ethics of data, see the [HCE Toolkit](https://data.berkeley.edu/hce-toolkit) and Tuskegee University's [National Center for Bioethics in Research and Health Care](https://www.tuskegee.edu/about-us/centers-of-excellence/bioethics-center).
Collected here are a variety of resources for a more in-depth treatment of the larger themes in this book. In addition to recommendations for these broad topics, we provide a list of resources for several smaller topics and big topics that we only lightly touched on. These resources are organized in the order in which the topics appear in the book.

+ Data privacy - SAM
- For how to analyze time-series data, like the Google Flu trends, we refer you to [_Time Series Analysis and Its Applications_](https://doi.org/10.1007/978-3-319-52452-8) by Shumway and Stoffer.

+ Ramdas gave an informative talk on bias, Simpson's paradox, p-hacking, and related topics; see the [screencast](https://www.youtube.com/watch?v=wGcjGH-zIL4) and [slides](https://drive.google.com/file/d/0B7gkaDYGT5X5c245RV93MVRRSjQ/view?resourcekey=0-8nQDM50Tta2SuLkFqAXEqQ).
- To learn more about the interplay between questions and data, we recommend [Questions, Answers, and Statistics](https://iase-web.org/documents/papers/icots2/Speed.pdf) by Speed. In addition, Leek and Peng connect questions with the type of analysis needed in [What is the question? Mistaking the type of question being considered is the most common error in data analysis](https://doi.org/10.1126/science.aaa6146).

+ For an introductory treatment of the urn model, confidence intervals, and hypothesis tests, we recommend [*Statistics*](https://wwnorton.com/books/Statistics/) by Freedman, Pisani, and Purves.
- More on sampling topics can be found in [_Sampling: Design and Analysis_](https://doi.org/10.1201/9780429298899) by Lohr. Lohr also contains a treatment of the target population, access frame, and sampling methods, and sources of bias.

+ Owen's online text, [*Monte Carlo theory, methods and examples*](https://artowen.su.domains/mc/) provides a solid introduction to simulation.
- To learn more about the human contexts and ethics of data, see the [HCE Toolkit](https://data.berkeley.edu/hce-toolkit) and Tuskegee University's [National Center for Bioethics in Research and Health Care](https://www.tuskegee.edu/about-us/centers-of-excellence/bioethics-center).

+ For a fuller treatment of probability, we suggest [*Probability*](https://doi.org/10.1007/978-1-4612-4374-8) by Pitman and [*Introduction to Probability*](https://doi.org/10.1201/b17221) by Hwang and Blitzstein.
- To learn more about data privacy, see XXX (TODO(sam): fill this in)

+ A proof that the median minimizes absolute error can be found in [*Mathematical Statistics: Basic Ideas and Selected Topics Volume I*](https://www.routledge.com/Mathematical-Statistics-Basic-Ideas-and-Selected-Topics-Volume-I-Second/Bickel-Doksum/p/book/9781498723800) by Bickel and Doksum.
- Ramdas gave an informative talk on bias, Simpson's paradox, p-hacking, and related topics; see the [screencast](https://www.youtube.com/watch?v=wGcjGH-zIL4) and [slides](https://drive.google.com/file/d/0B7gkaDYGT5X5c245RV93MVRRSjQ/view?resourcekey=0-8nQDM50Tta2SuLkFqAXEqQ).

+ [* *]() by for Pandas SAM
- For an introductory treatment of the urn model, confidence intervals, and hypothesis tests, we recommend [_Statistics_](https://wwnorton.com/books/Statistics/) by Freedman, Pisani, and Purves.

+ The classic [*The Essence of Databases*](https://dl.acm.org/doi/book/10.5555/274800) by Roland offers a formal introduction to SQL, and the basics can be found in W3 School's [Introduction to SQL](https://www.w3schools.com/sql/sql_intro.asp).
- Owen's online text, [_Monte Carlo theory, methods and examples_](https://artowen.su.domains/mc/) provides a solid introduction to simulation.

+ A good resource for data wrangling can be found in [*Principles of Data Wrangling: Practical Techniques for Data Preparation*](https://www.oreilly.com/library/view/principles-of-data/9781491938911/) by Rattenbury, Hellerstein, Heer, Kandel, and Carreras.
- For a fuller treatment of probability, we suggest [_Probability_](https://doi.org/10.1007/978-1-4612-4374-8) by Pitman and [_Introduction to Probability_](https://doi.org/10.1201/b17221) by Hwang and Blitzstein.

+ For how to handle missing data, see [*Statistical Analysis with Missing Data*](https://www.wiley.com/en-us/Statistical+Analysis+with+Missing+Data,+3rd+Edition-p-9780470526798) by Little and Rubin.
- A proof that the median minimizes absolute error can be found in [_Mathematical Statistics: Basic Ideas and Selected Topics Volume I_](https://www.routledge.com/Mathematical-Statistics-Basic-Ideas-and-Selected-Topics-Volume-I-Second/Bickel-Doksum/p/book/9781498723800) by Bickel and Doksum.

+ The original text by Tukey, [*Exploratory Data Analysis*](https://archive.org/details/exploratorydataa00tuke_0), offers an introduction to the topic.
- [_Python for Data Analysis_](https://wesmckinney.com/book/) by Wes McKinney provides in-depth coverage of `pandas`.

+ The smooth density curve is covered in detail in [*Density Estimation for Statistics and Data Analysis*](https://www.routledge.com/Density-Estimation-for-Statistics-and-Data-Analysis/Silverman/p/book/9780412246203) by Silverman.
- The classic [_The Essence of Databases_](https://dl.acm.org/doi/book/10.5555/274800) by Roland offers a formal introduction to SQL, and the basics can be found in W3 School's [Introduction to SQL](https://www.w3schools.com/sql/sql_intro.asp). [_Designing Data-Intensive Applications_](https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/) surveys and compares different data storage systems, including SQL databases.

+ See [*Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures*](https://clauswilke.com/dataviz/) by Wilke for more on visualization. Our guidelines do not entirely match Wilke's but they come close and it's helpful to see a variety of opinions on the topic.
- A good resource for data wrangling can be found in [_Principles of Data Wrangling: Practical Techniques for Data Preparation_](https://www.oreilly.com/library/view/principles-of-data/9781491938911/) by Rattenbury, Hellerstein, Heer, Kandel, and Carreras.

+ To learn more about color palettes see Brewer's online [ColorBrewer2.0](https://colorbrewer2.org/).
- For how to handle missing data, see [_Statistical Analysis with Missing Data_](https://www.wiley.com/en-us/Statistical+Analysis+with+Missing+Data,+3rd+Edition-p-9780470526798) by Little and Rubin.

+ See [Statistical Calibration: A Review](https://doi.org/10.2307/1403690) by Osborne for more on calibration.
- The original text by Tukey, [_Exploratory Data Analysis_](https://archive.org/details/exploratorydataa00tuke_0), offers an introduction to the topic.

+ For practice with regular expressions there are many on-line resources such as the W3 School tutorial [Python RegEx](https://www.w3schools.com/python/python_regex.asp), regular expression checkers like [Regular Expressions 101](https://regex101.com/), and introductions to the topic like [An introduction to regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) by Nield. For a text see [*Mastering Regular Expressions*](https://dl.acm.org/doi/10.5555/1209014) by Friedl.
- The smooth density curve is covered in detail in [_Density Estimation for Statistics and Data Analysis_](https://www.routledge.com/Density-Estimation-for-Statistics-and-Data-Analysis/Silverman/p/book/9780412246203) by Silverman.

+ Chapter 13 in [Fox](https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254) and Chapter 10 in [James, et al.](https://www.statlearning.com/) discuss Principal Components. (See below for the titles of these references.)
- See [_Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures_](https://clauswilke.com/dataviz/) by Wilke for more on visualization. Our guidelines do not entirely match Wilke's but they come close and it's helpful to see a variety of opinions on the topic.

+ Tompkins has an online tutorial on how to work with netCDF climate data; see [The Beauty of NetCDF](https://www.youtube.com/watch?v=UvNBnjiTXa0).
- To learn more about color palettes see Brewer's online [ColorBrewer2.0](https://colorbrewer2.org/).

+ There are many resources on web services. Some accessible introductory material can be found at [*RESTful Web Services*](https://dl.acm.org/doi/10.5555/1406352)
by Richardson and Ruby.
- See [Statistical Calibration: A Review](https://doi.org/10.2307/1403690) by Osborne for more on calibration.

+ For more on XML, we recommend [*XML and Web Technologies for Data Sciences with R*](https://doi.org/10.1007/978-1-4614-7900-0) by Nolan and Temple Lang.
- For practice with regular expressions there are many on-line resources such as the W3 School tutorial [Python RegEx](https://www.w3schools.com/python/python_regex.asp), regular expression checkers like [Regular Expressions 101](https://regex101.com/), and introductions to the topic like [An introduction to regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) by Nield. For a text see [_Mastering Regular Expressions_](https://dl.acm.org/doi/10.5555/1209014) by Friedl.

+ The many topics related to modeling, including transformations, one-hot encoding, model-selection, cross-validation, and regularization can be found in several sources. We recommend: [*Linear Models with Python*](https://julianfaraway.github.io/LMP/) by Faraway, [*Applied Regression Analysis and Generalized Linear Models*](https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254) by Fox, [*An Introduction to Statistical Learning: With Applications in Python*](https://www.statlearning.com/) by James, Witten, Hastie, Tibshirani, and Taylor, and [*Applied Linear Regression*](https://doi.org/10.1002/0471704091) by Weisberg.
- Chapter 13 in [Fox](https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254) and Chapter 10 in [James, et al.](https://www.statlearning.com/) discuss Principal Components. (See below for the titles of these references.)

+ Chapter 10 in Fox gives an informative treatment of vector geometry of least squares.
- Tompkins has an online tutorial on how to work with netCDF climate data; see [The Beauty of NetCDF](https://www.youtube.com/watch?v=UvNBnjiTXa0).

+ xkcd has a fun cartoon explaining [confounding variables](https://www.explainxkcd.com/wiki/index.php/2560:_Confounding_Variables).
- There are many resources on web services. Some accessible introductory material can be found at [_RESTful Web Services_](https://dl.acm.org/doi/10.5555/1406352)
by Richardson and Ruby.

+ Chapter 12 in Fox and Chapter 5 in Faraway cover the topic of weighted regression.
- For more on XML, we recommend [_XML and Web Technologies for Data Sciences with R_](https://doi.org/10.1007/978-1-4614-7900-0) by Nolan and Temple Lang.

+ Andrew Ng's [interview](https://spectrum.ieee.org/andrew-ng-xrays-the-ai-hype) is an interesting read on the gap between the test set and the real world.
- The many topics related to modeling, including transformations, one-hot encoding, model-selection, cross-validation, and regularization can be found in several sources. We recommend: [_Linear Models with Python_](https://julianfaraway.github.io/LMP/) by Faraway, [_Applied Regression Analysis and Generalized Linear Models_](https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254) by Fox, [_An Introduction to Statistical Learning: With Applications in Python_](https://www.statlearning.com/) by James, Witten, Hastie, Tibshirani, and Taylor, and [_Applied Linear Regression_](https://doi.org/10.1002/0471704091) by Weisberg.

+ Chapter 7 of James, et al. introduces polynomial regression using orthogonal polynomials.
- Chapter 10 in Fox gives an informative treatment of vector geometry of least squares.

+ For more on broken-stick regression see [Bent-Cable Regression Theory and Applications](https://doi.org/10.1198/016214505000001177) by Chiu, Lockhart and Routledge.
- xkcd has a fun cartoon explaining [confounding variables](https://www.explainxkcd.com/wiki/index.php/2560:_Confounding_Variables).

+ A more formal treatment of confidence intervals, prediction intervals, testing, and the bootstrap can be found in [*Mathematical Statistics and Data Analysis*](https://www.cengage.com/c/mathematical-statistics-and-data-analysis-3e-rice/9780534399429/) by Rice.
- Chapter 12 in Fox and Chapter 5 in Faraway cover the topic of weighted regression.

+ The [The ASA Statement on p-Values: Context, Process, and Purpose](https://doi.org/10.1080/00031305.2016.1154108) by Wasserstein and Lazar provides valuable insights into the $p$-value. Additionally, the topic of p-hacking is addressed in [The Statistical Crisis in Science](https://doi.org/10.1511/2014.111.460) by Gelman and Loken.
- Andrew Ng's [interview](https://spectrum.ieee.org/andrew-ng-xrays-the-ai-hype) is an interesting read on the gap between the test set and the real world.

+ Information about rank tests and other nonparametric statistics can be found in [*Nonparametric Rank Tests*](https://doi.org/10.1007/978-3-642-04898-2_417_) by Hettmansperger.
- Chapter 7 of James, et al. introduces polynomial regression using orthogonal polynomials.

+ The technique for developing linear models to use in the field is addressed in [The lost art of nomography](https://deadreckonings.files.wordpress.com/2008/01/nomography.pdf) by Doerfler.
- For more on broken-stick regression see [Bent-Cable Regression Theory and Applications](https://doi.org/10.1198/016214505000001177) by Chiu, Lockhart and Routledge.

+ Chapter 14 in Fox covers the maximum likelihood approach to logistic regression. And, Chapter 4 in James, et al. covers sensitivity and specificity in more detail.
- A more formal treatment of confidence intervals, prediction intervals, testing, and the bootstrap can be found in [_Mathematical Statistics and Data Analysis_](https://www.cengage.com/c/mathematical-statistics-and-data-analysis-3e-rice/9780534399429/) by Rice.

+ An in-depth treatment of loss functions and risk can be found in Chapter 12 of [*All of Statistics: A Concise Course in Statistical Inference*](https://doi.org/10.1007/978-0-387-21736-9) by Wasserman.

+ [*Programming Collective Intelligence*](https://www.oreilly.com/library/view/programming-collective-intelligence/9780596529321/) by Segaran covers the topic of optimization.
- The [The ASA Statement on p-Values: Context, Process, and Purpose](https://doi.org/10.1080/00031305.2016.1154108) by Wasserstein and Lazar provides valuable insights into the $p$-value. Additionally, the topic of p-hacking is addressed in [The Statistical Crisis in Science](https://doi.org/10.1511/2014.111.460) by Gelman and Loken.

+ See [*Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning*](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/) by Bengfort, Bilbro, and Ojeda for more on text analysis.
- Information about rank tests and other nonparametric statistics can be found in [_Nonparametric Rank Tests_](https://doi.org/10.1007/978-3-642-04898-2_417_) by Hettmansperger.

- The technique for developing linear models to use in the field is addressed in [The lost art of nomography](https://deadreckonings.files.wordpress.com/2008/01/nomography.pdf) by Doerfler.

- Chapter 14 in Fox covers the maximum likelihood approach to logistic regression. And, Chapter 4 in James, et al. covers sensitivity and specificity in more detail.

- An in-depth treatment of loss functions and risk can be found in Chapter 12 of [_All of Statistics: A Concise Course in Statistical Inference_](https://doi.org/10.1007/978-0-387-21736-9) by Wasserman.

- [_Programming Collective Intelligence_](https://www.oreilly.com/library/view/programming-collective-intelligence/9780596529321/) by Segaran covers the topic of optimization.

- See [_Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning_](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/) by Bengfort, Bilbro, and Ojeda for more on text analysis.

0 comments on commit 9811b9f

Please sign in to comment.