Skip to content

Commit

Permalink
Clean up data sources
Browse files Browse the repository at this point in the history
  • Loading branch information
SamLau95 committed May 30, 2023
1 parent 70c8607 commit a60b81a
Showing 1 changed file with 39 additions and 40 deletions.
79 changes: 39 additions & 40 deletions content/data_sources.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,74 @@
(ax:data_source)=
# Data Sources

All of the data analyzed in this book are available on the book's website, [LearningDS.org](https://learningds.org/) and on the [GitHub repository](https://github.com/DS-100/textbook/) for the book. These datasets are from open repositories and from individuals. We acknowledge them all here, and include, as appropriate, the file name for the data stored in our repository, a link to the original source, a related publication, and the author(s)/owner(s).
All of the data analyzed in this book are available on the book's website, [LearningDS.org](https://learningds.org/) and on the [GitHub repository](https://github.com/DS-100/textbook/) for the book. These datasets are from open repositories and from individuals. We acknowledge them all here, and include, as appropriate, the file name for the data stored in our repository, a link to the original source, a related publication, and the author(s)/owner(s).

To begin, we provide the sources for the four case studies in the book. Our analysis of the data in these case studies is based on research articles or, in one case, a blog post. We generally follow the line of inquiry in these sources, but we have usually simplified the analyses to match the level of the book.
To begin, we provide the sources for the four case studies in the book. Our analysis of the data in these case studies is based on research articles or, in one case, a blog post. We generally follow the line of inquiry in these sources, but we have usually simplified the analyses to match the level of the book.

+ `seattle_bus_times.csv` The Seattle Transit data were provided by Hallenbeck of the [Washington State Transportation Center](https://depts.washington.edu/trac/). Our analysis is
based on [The Waiting Time Paradox, or, Why Is My Bus Always Late?](https://jakevdp.github.io/blog/2018/09/13/waiting-time-paradox/#:~:text=It%20turns%20out%20that%20under,as%20the%20waiting%20time%20paradox) by VanderPlas.
- `seattle_bus_times.csv`: The Seattle Transit data were provided by Hallenbeck of the [Washington State Transportation Center](https://depts.washington.edu/trac/). Our analysis is
based on [The Waiting Time Paradox, or, Why Is My Bus Always Late?](https://jakevdp.github.io/blog/2018/09/13/waiting-time-paradox/#:~:text=It%20turns%20out%20that%20under,as%20the%20waiting%20time%20paradox) by VanderPlas.

+ `aqs_06-067-0010.csv`, `list_of_aqs_sites.csv`, `matched_pa_aqs.csv`, `list_of_purpleair_sensors.json`, `purpleair_AMTS` The datasets used in the study of air quality monitors were made available to us by Barkjohn of the Environmental Protection Agency. These were originally acquired by Barkjohn and collaborators from the [US Air Quality System](https://forum.airnowtech.org/t/the-aqi-equation/169) and from [PurpleAir](https://www2.purpleair.com/).
Our analysis is based on [Development and Application of a United States-Wide Correction for PM 2.5 Data Collected with the PurpleAir Sensor](https://amt.copernicus.org/articles/14/4617/2021/) by Barkjohn, Gantt, and Clements.
- `aqs_06-067-0010.csv`, `list_of_aqs_sites.csv`, `matched_pa_aqs.csv`, `list_of_purpleair_sensors.json`, `purpleair_AMTS`: The datasets used in the study of air quality monitors were made available to us by Barkjohn of the Environmental Protection Agency. These were originally acquired by Barkjohn and collaborators from the [US Air Quality System](https://forum.airnowtech.org/t/the-aqi-equation/169) and from [PurpleAir](https://www2.purpleair.com/).
Our analysis is based on [Development and Application of a United States-Wide Correction for PM 2.5 Data Collected with the PurpleAir Sensor](https://amt.copernicus.org/articles/14/4617/2021/) by Barkjohn, Gantt, and Clements.

+ `donkeys.csv` The data for the Kenyan donkey study were collected by Kate Milner on behalf of the UK Donkey Sanctuary and made available by Rougier in the [paranomo package](https://people.maths.bris.ac.uk/~mazjcr/paranomo_1.1.tar.gz).
Our analysis is based on [How to Weigh a Donkey in the Kenyan Countryside](https://doi.org/10.1111/j.1740-9713.2014.00768.x) by Milner and Rougier.
- `donkeys.csv`: The data for the Kenyan donkey study were collected by Kate Milner on behalf of the UK Donkey Sanctuary and made available by Rougier in the [paranomo package](https://people.maths.bris.ac.uk/~mazjcr/paranomo_1.1.tar.gz).
Our analysis is based on [How to Weigh a Donkey in the Kenyan Countryside](https://doi.org/10.1111/j.1740-9713.2014.00768.x) by Milner and Rougier.

+ `fake_news.csv` The hand-classified fake news data are from
[Fakenewsnet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media]() by Shu, Mahudeswaran, Wang, Lee, and Liu.
- `fake_news.csv`: The hand-classified fake news data are from
[Fakenewsnet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media]() by Shu, Mahudeswaran, Wang, Lee, and Liu.

In addition to these case studies, another 20-plus datasets were used as examples throughout the book. We acknowledge the people and organizations that made these datasets available in the order in which they appeared in the book.
In addition to these case studies, another 20-plus datasets were used as examples throughout the book. We acknowledge the people and organizations that made these datasets available in the order in which they appeared in the book.

+ `gft.csv` The data on the Google Flu Trends is available from [Gary King Dataverse](https://doi.org/10.7910/DVN/24823) and the plot made from these data is based on
[The Parable of Google Flu: Traps in Big Data Analysis](https://doi.org/10.1126/science.1248506) by Lazer, Kennedy, King, and Vespignani.
- `gft.csv`: The data on the Google Flu Trends is available from [Gary King Dataverse](https://doi.org/10.7910/DVN/24823) and the plot made from these data is based on
[The Parable of Google Flu: Traps in Big Data Analysis](https://doi.org/10.1126/science.1248506) by Lazer, Kennedy, King, and Vespignani.

+ `WikipediaExp.csv` The data for the Wikipedia experiment were made available by van de Rijt. These data were analyzed in [Experimental Study of Informal Rewards in Peer Production](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034358) by Restivo and van de Rijt.
- `WikipediaExp.csv`: The data for the Wikipedia experiment were made available by van de Rijt. These data were analyzed in [Experimental Study of Informal Rewards in Peer Production](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0034358) by Restivo and van de Rijt.

+ `co2_mm_mlo.txt` The CO2 concentrations measured at Mauna Loa by the [National Oceanic and Atmospheric Administration (NOAA)](https://www.noaa.gov/) are available from the [Global Monitoring Laboratory](https://gml.noaa.gov/obop/mlo/).
- `co2_mm_mlo.txt`: The CO2 concentrations measured at Mauna Loa by the [National Oceanic and Atmospheric Administration (NOAA)](https://www.noaa.gov/) are available from the [Global Monitoring Laboratory](https://gml.noaa.gov/obop/mlo/).

+ `pm30.csv` These air quality measurements were downloaded for one day and one sensor from the [PurpleAir Map](https://www2.purpleair.com/).
- `pm30.csv`: These air quality measurements were downloaded for one day and one sensor from the [PurpleAir Map](https://www2.purpleair.com/).

+ `babynames.csv` The [US Social Security Department](https://www.ssa.gov/oact/babynames/index.html) provides the names from all Social Security card applications.
- `babynames.csv`: The [US Social Security Department](https://www.ssa.gov/oact/babynames/index.html) provides the names from all Social Security card applications.

+ `DAWN-Data.txt` The [2011 DAWN](https://www.datafiles.samhsa.gov/dataset/drug-abuse-warning-network-2011-dawn-2011-ds0001) survey of drug-related emergency room visits is administered by the [U.S. Substance Abuse and Medical Health Services Administration](https://www.samhsa.gov/).
- `DAWN-Data.txt`: The [2011 DAWN](https://www.datafiles.samhsa.gov/dataset/drug-abuse-warning-network-2011-dawn-2011-ds0001) survey of drug-related emergency room visits is administered by the [U.S. Substance Abuse and Medical Health Services Administration](https://www.samhsa.gov/).

+ `businesses.csv`, `inspections.csv`, `violations.csv` The data on restaurant inspection scores in San Francisco is from [DataSF](https://datasf.org/).
- `businesses.csv`, `inspections.csv`, `violations.csv`: The data on restaurant inspection scores in San Francisco is from [DataSF](https://datasf.org/).

+ `akc.csv` The data on dog breeds come from the Information is Beautiful [Best in Show](https://www.informationisbeautiful.net/visualizations/best-in-show-whats-the-top-data-dog/) visualization and was originally acquired from the [American Kennel Club](https://www.akc.org/).
- `akc.csv`: The data on dog breeds come from the Information is Beautiful [Best in Show](https://www.informationisbeautiful.net/visualizations/best-in-show-whats-the-top-data-dog/) visualization and was originally acquired from the [American Kennel Club](https://www.akc.org/).

+ `sfhousing.csv` The housing sale prices for the San Francisco Bay Area were scraped from the [San Francisco Chronicle](https://www.sfchronicle.com/realestate/) real estate pages.
- `sfhousing.csv`: The housing sale prices for the San Francisco Bay Area were scraped from the [San Francisco Chronicle](https://www.sfchronicle.com/realestate/) real estate pages.

+ `cherryBlossomMen.csv` The run times in the annual [Cherry Blossom 10 mile Run](https://www.cherryblossom.org/) were scraped from the race results pages.
- `cherryBlossomMen.csv`: The run times in the annual [Cherry Blossom 10 mile Run](https://www.cherryblossom.org/) were scraped from the race results pages.

+ `earnings2020.csv` The weekly earnings data are made available by the [U.S. Bureau of Labor Statistics](https://www.bls.gov/opub/reports/womens-earnings/2020/home.htm).
- `earnings2020.csv`: The weekly earnings data are made available by the [U.S. Bureau of Labor Statistics](https://www.bls.gov/opub/reports/womens-earnings/2020/home.htm).

+ `co2_by_country.csv` The annual country CO2 emissions is available from [Our World in Data](https://ourworldindata.org/).
- `co2_by_country.csv`: The annual country CO2 emissions is available from [Our World in Data](https://ourworldindata.org/).

+ `100m_sprint.csv` The times for the 100 meter sprint are from [FiveThirtyEight](https://fivethirtyeight.com/) and the figure is based on
[The fastest men in the world are still chasing Usain bolt](https://fivethirtyeight.com/features/the-fastest-men-in-the-world-are-still-chasing-usain-bolt/) by Planos.
- `100m_sprint.csv`: The times for the 100 meter sprint are from [FiveThirtyEight](https://fivethirtyeight.com/) and the figure is based on
[The fastest men in the world are still chasing Usain bolt](https://fivethirtyeight.com/features/the-fastest-men-in-the-world-are-still-chasing-usain-bolt/) by Planos.

+ `stateoftheunion1790-2022.txt` The State of the Union Addresses are compiled from the [American Presidency Project](https://www.presidency.ucsb.edu/documents/app-categories/spoken-addresses-and-remarks/presidential/state-the-union-addresses).
- `stateoftheunion1790-2022.txt`: The State of the Union Addresses are compiled from the [American Presidency Project](https://www.presidency.ucsb.edu/documents/app-categories/spoken-addresses-and-remarks/presidential/state-the-union-addresses).

+ `'CDS_ERA5_22-12.nc'` These data were collected from the [Climate Data Store](https://cds.climate.copernicus.eu/), which is supported by the [European Centre for Medium-Range Weather Forecasts](https://www.ecmwf.int/).
- `CDS_ERA5_22-12.nc`: These data were collected from the [Climate Data Store](https://cds.climate.copernicus.eu/), which is supported by the [European Centre for Medium-Range Weather Forecasts](https://www.ecmwf.int/).

+ 'world_record_1500m.csv' Wikipedia 1500 meter world records were scraped from the Wikipedia page [1500 metres world record progression](https://en.wikipedia.org/wiki/1500_metres_world_record_progression).
- `world_record_1500m.csv`: Wikipedia 1500 meter world records were scraped from the Wikipedia page [1500 metres world record progression](https://en.wikipedia.org/wiki/1500_metres_world_record_progression).

+ 'the_clash.csv' The Clash songs are obtained using the [Spotify Web API](https://developer.spotify.com/documentation/web-api).
The retrieval of the data follows [Exploring the Spotify API in Python](https://stmorse.github.io/journal/spotify-api.html) by Morse.
- `the_clash.csv`: The Clash songs are obtained using the [Spotify Web API](https://developer.spotify.com/documentation/web-api).
The retrieval of the data follows [Exploring the Spotify API in Python](https://stmorse.github.io/journal/spotify-api.html) by Morse.

+ `catalog.xml` The XML plant catalog document is from the [W3 School Plant catalog](https://www.w3schools.com/xml/plant_catalog.xml).
- `catalog.xml`: The XML plant catalog document is from the [W3 School Plant catalog](https://www.w3schools.com/xml/plant_catalog.xml).

+ 'ECB_EU_exchange.csv' The exchange rates are available from the [European Central Bank](https://www.ecb.europa.eu/stats/eurofxref/).
- `ECB_EU_exchange.csv`: The exchange rates are available from the [European Central Bank](https://www.ecb.europa.eu/stats/eurofxref/).

+ `mobility.csv` These data were made available at [Opportunity Insights](https://opportunityinsights.org/paper/land-of-opportunity/) and our example follows
[Where Is the Land of Opportunity? The Geography of Intergenerational Mobility in the United States](https://doi.org/10.1093/qje/qju022) by Chetty, Hendren, Kline, and Saez.
- `mobility.csv`: These data were made available at [Opportunity Insights](https://opportunityinsights.org/paper/land-of-opportunity/) and our example follows
[Where Is the Land of Opportunity? The Geography of Intergenerational Mobility in the United States](https://doi.org/10.1093/qje/qju022) by Chetty, Hendren, Kline, and Saez.

+ `utilities.csv` The home energy consumption data is available from [Kaplan](https://www.key2stats.com/Utility_bills_1294_92.csv) and appeared in his first edition of [*Statistical Modeling: A fresh approach*](https://dtkaplan.github.io/SM2-bookdown/preface-to-this-electronic-version.html).
- `utilities.csv`: The home energy consumption data is available from [Kaplan](https://www.key2stats.com/Utility_bills_1294_92.csv) and appeared in his first edition of [_Statistical Modeling: A fresh approach_](https://dtkaplan.github.io/SM2-bookdown/preface-to-this-electronic-version.html).

+ `market-analysis.csv` These data were provided by Lipovetsky, and they correspond to the data used in his paper [Regressions Regularized by Correlations](https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=2530&context=jmasm).
- `market-analysis.csv`: These data were provided by Lipovetsky, and they correspond to the data used in his paper [Regressions Regularized by Correlations](https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=2530&context=jmasm).

+ `crabs.data` The crab measurements are from the [California Department of Fish and Wildlife](https://wildlife.ca.gov/) and made available online at the [Stat Labs Data](https://www.stat.berkeley.edu/users/statlabs/data/crabs.data) repository.
- `crabs.data`: The crab measurements are from the [California Department of Fish and Wildlife](https://wildlife.ca.gov/) and made available online at the [Stat Labs Data](https://www.stat.berkeley.edu/users/statlabs/data/crabs.data) repository.

+ `black_spruce.csv` The wind-damaged tree data were collected by Rich for his thesis [Large wind disturbance in the Boundary Waters Canoe Area Wilderness. Forest dynamics and development changes associated with the July 4th 1999 blowdown](https://www.proquest.com/docview/305463532?pq-origsite=gscholar&fromopenview=true) and made available online in the [alr4 package](https://cran.r-project.org/web/packages/alr4/alr4.pdf).
The analysis is based on Chapter 12 of [Applied Linear Regression](https://doi.org/10.1002/0471704091) by Weisberg.
- `black_spruce.csv`: The wind-damaged tree data were collected by Rich for his thesis [Large wind disturbance in the Boundary Waters Canoe Area Wilderness. Forest dynamics and development changes associated with the July 4th 1999 blowdown](https://www.proquest.com/docview/305463532?pq-origsite=gscholar&fromopenview=true) and made available online in the [alr4 package](https://cran.r-project.org/web/packages/alr4/alr4.pdf). The analysis is based on Chapter 12 of [Applied Linear Regression](https://doi.org/10.1002/0471704091) by Weisberg.

[github]: https://github.com/DS-100/textbook/

0 comments on commit a60b81a

Please sign in to comment.