Skip to content

Commit

Permalink
Clean up code for ch2 and ch3
Browse files Browse the repository at this point in the history
  • Loading branch information
SamLau95 committed May 23, 2023
1 parent 32e2f86 commit e8b15c4
Show file tree
Hide file tree
Showing 7 changed files with 85,855 additions and 85,768 deletions.
6 changes: 3 additions & 3 deletions content/_static/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -210,8 +210,8 @@ img {
}

/* don't top-align footnotes */
.footnote-reference {
vertical-align: baseline;
a.footnote-reference {
vertical-align: initial;
}

/* Make bibliography font smaller */
Expand All @@ -223,7 +223,7 @@ img {
* MathJax
*****************************************************************************/

.MathJax_Display {
div.math {
font-size: 1.21rem;
}

Expand Down
37 changes: 24 additions & 13 deletions content/ch/02/data_scope_big_data_hubris.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,50 +18,56 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(sec:scope_bigdata)=\n",
"# Big Data and New Opportunities \n",
"\n",
"The tremendous increase in openly available data has created new roles and opportunities in data science. For example, data journalists look for interesting stories in data much like how traditional beat reporters hunt for news stories. The data lifecycle for the data journalist begins with the search for existing data that might have an interesting story, rather than beginning with a research question and looking for how to collect new or use existing data to address the question. "
"# Big Data and New Opportunities\n",
"\n",
"The tremendous increase in openly available data has created new roles and opportunities in data science. For example, data journalists look for interesting stories in data much like how traditional beat reporters hunt for news stories. The data lifecycle for the data journalist begins with the search for existing data that might have an interesting story, rather than beginning with a research question and looking for how to collect new or use existing data to address the question.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Citizen science projects are another example. They engage many people (and instruments) in data collection. Collectively, these data are made available to researchers who organize the project and often they are made available in repositories for the general public to further investigate. "
"Citizen science projects are another example. They engage many people (and instruments) in data collection. Collectively, these data are made available to researchers who organize the project and often they are made available in repositories for the general public to further investigate.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The availability of administrative and organizational data creates other opportunities. Researchers can link data collected from scientific studies with, say, medical data that have been collected for healthcare purposes; these administrative data have been collected for reasons that don't directly stem from the question of interest, but they can be useful in other settings. Such linkages can help data scientists expand the possibilities of their analyses and cross-check the quality of their data. In addition, found data can include digital traces, such as your web-browsing activity, posts on social media, and your online network of friends and acquaintances, and they can be quite complex. "
"The availability of administrative and organizational data creates other opportunities. Researchers can link data collected from scientific studies with, say, medical data that have been collected for healthcare purposes; these administrative data have been collected for reasons that don't directly stem from the question of interest, but they can be useful in other settings. Such linkages can help data scientists expand the possibilities of their analyses and cross-check the quality of their data. In addition, found data can include digital traces, such as your web-browsing activity, posts on social media, and your online network of friends and acquaintances, and they can be quite complex.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"When we have large amounts of administrative data or expansive digital traces, it can be tempting to treat them as more definitive than data collected from traditional smaller research studies. We might even consider these large datasets as a replacement for scientific studies and essentially a census. This over-reach is referred to as the [\"big data hubris\"](https://doi.org/10.1126/science.1248506). Data with a large scope does not mean that we can ignore foundational issues of how representative the data are, nor can we ignore issues with measurement, dependency, and reliability. (And it can be easy to discover meaningless or nonsensical relationships just by coincidence.) One well-known example is the Google Flu Trends tracking system. "
"When we have large amounts of administrative data or expansive digital traces, it can be tempting to treat them as more definitive than data collected from traditional smaller research studies. We might even consider these large datasets as a replacement for scientific studies and essentially a census. This over-reach is referred to as the [\"big data hubris\"](https://doi.org/10.1126/science.1248506). Data with a large scope does not mean that we can ignore foundational issues of how representative the data are, nor can we ignore issues with measurement, dependency, and reliability. (And it can be easy to discover meaningless or nonsensical relationships just by coincidence.) One well-known example is the Google Flu Trends tracking system.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Google Flu Trends\n",
"\n",
"[Digital epidemiology](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5754279/), a new subfield of epidemiology, leverages data generated outside the public health system to study patterns of disease and health dynamics in populations[^nih].\n",
"[Digital epidemiology](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5754279/), a new subfield of epidemiology, leverages data generated outside the public health system to study patterns of disease and health dynamics in populations.\n",
"The Google Flu Trends (GFT) tracking system was one of the earliest examples of digital epidemiology.\n",
"In 2007, researchers found that counting the searches people made for flu-related\n",
"terms could accurately estimate the number of flu cases.\n",
"This apparent success made headlines, and many researchers became excited about the possibilities of big data.\n",
"However, GFT did not live up to expectations and was abandoned in 2015.\n",
"\n",
"What went wrong? After all, GFT used millions of digital traces from online queries for terms related to influenza to predict flu activity. Despite initial success, in the 2011–2012 flu season, Google's data scientists found that GFT was not a substitute for the more traditional surveillance reports of three-week old counts collected by the Centers for Disease Control (CDC) from laboratories across the United States. In comparison, GFT overestimated the CDC numbers for 100 out of 108 weeks. Week after week, GFT came in too high for the cases of influenza, even though it was based on big data: "
"What went wrong? After all, GFT used millions of digital traces from online queries for terms related to influenza to predict flu activity. Despite initial success, in the 2011–2012 flu season, Google's data scientists found that GFT was not a substitute for the more traditional surveillance reports of three-week old counts collected by the Centers for Disease Control (CDC) from laboratories across the United States. In comparison, GFT overestimated the CDC numbers for 100 out of 108 weeks. Week after week, GFT came in too high for the cases of influenza, even though it was based on big data:\n"
]
},
{
Expand Down Expand Up @@ -2830,42 +2836,47 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"From weeks 412 to 519 in this plot, GFT (solid line) over estimated the actual CDC reports (dashed line) 100 times. Also plotted here are predictions from a model based on 3-week old CDC data and seasonal trends (dotted line), which follows the actuals more closely than GFT. "
"From weeks 412 to 519 in this plot, GFT (solid line) over estimated the actual CDC reports (dashed line) 100 times. Also plotted here are predictions from a model based on 3-week old CDC data and seasonal trends (dotted line), which follows the actuals more closely than GFT.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Data scientists found that a simple model built from past CDC reports that used 3-week-old CDC data and seasonal trends did a better job of predicting flu prevalence than GFT. The GFT overlooked considerable information that can be extracted by basic statistical methods. This does not mean that big data captured from online activity is useless. In fact, [researchers](https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/#:~:text=The%20paper%20demonstrated%20that%20search,into%20potentially%20life%2Dsaving%20insights.) have shown that the combination of GFT data with CDC data can substantially improve on both GFT predictions and the CDC-based model. It is often the case that combining different approaches leads to improvements over individual methods."
"Data scientists found that a simple model built from past CDC reports that used 3-week-old CDC data and seasonal trends did a better job of predicting flu prevalence than GFT. The GFT overlooked considerable information that can be extracted by basic statistical methods. This does not mean that big data captured from online activity is useless. In fact, [researchers have shown](https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/#:~:text=The%20paper%20demonstrated%20that%20search,into%20potentially%20life%2Dsaving%20insights.) that the combination of GFT data with CDC data can substantially improve on both GFT predictions and the CDC-based model. It is often the case that combining different approaches leads to improvements over individual methods.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The GFT example shows us that even when we have tremendous amounts of information, the connections between the data and the question being asked are paramount. Understanding this framework can help us avoid answering the wrong question, applying inappropriate methods to the data, and overstating our findings. "
"The GFT example shows us that even when we have tremendous amounts of information, the connections between the data and the question being asked are paramount. Understanding this framework can help us avoid answering the wrong question, applying inappropriate methods to the data, and overstating our findings.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
":::{note}\n",
"\n",
"In the age of big data, we are tempted to collect more and more data to answer a question precisely. After all, a census gives us perfect information, so shouldn't big data be nearly perfect? Unfortunately, this is often not the case, especially with administrative data and digital traces. The inaccessibility of a small fraction of the people you want to study (see the 2016 election upset in {numref}`Chapter %s <ch:theory_datadesign>`) or the measurement process itself (as in this GFT example) can lead to poor predictions. It is important to consider the scope of the data as it relates to the question under investigation. \n",
"In the age of big data, we are tempted to collect more and more data to answer a question precisely. After all, a census gives us perfect information, so shouldn't big data be nearly perfect? Unfortunately, this is often not the case, especially with administrative data and digital traces. The inaccessibility of a small fraction of the people you want to study (see the 2016 election upset in {numref}`Chapter %s <ch:theory_datadesign>`) or the measurement process itself (as in this GFT example) can lead to poor predictions. It is important to consider the scope of the data as it relates to the question under investigation.\n",
"\n",
":::"
":::\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"A key factor to keep in mind is the scope of the data. Scope includes considering the population we want to study, how to access information about that population, and what we are actually measuring. Thinking through these points can help us see potential gaps in our approach. This is the topic of the next section."
"A key factor to keep in mind is the scope of the data. Scope includes considering the population we want to study, how to access information about that population, and what we are actually measuring. Thinking through these points can help us see potential gaps in our approach. We investigate this in the next section.\n"
]
}
],
Expand Down
171,202 changes: 85,607 additions & 85,595 deletions content/ch/03/theory_election.ipynb

Large diffs are not rendered by default.

79 changes: 33 additions & 46 deletions content/ch/03/theory_measurement_error.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -28,13 +29,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The sensors measure the amount of particulate matter in the air that have a diameter smaller than 2.5 micrometers (the unit of measurement is micrograms per cubic meter: μg/m3). The measurements recorded are the average concentrations over 2-minutes. While the level of particulate matter changes over the course of a day as, for example, people commute to and from work, there are certain times of the day, like at midnight, when we expect the 2-minute averages to change little in a half-hour. If we examine the measurements taken during these times of the day, we can get a sense of the combined variability in the instrument recordings and the mixing of particles in the air. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -164,6 +167,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -1291,31 +1295,26 @@
"source": [
"fig = px.line(pm, x=\"time\", y=\"aq2.5\", color=\"hour\")\n",
"\n",
"fig.add_annotation(x=12, y=5,\n",
" text=\"midnight\", showarrow=False)\n",
"fig.add_annotation(x=24, y=6,\n",
" text=\"11 am\", showarrow=False)\n",
"fig.add_annotation(x=36, y=8,\n",
" text=\"7 pm\", showarrow=False)\n",
"\n",
"fig.update_xaxes(showticklabels = False)\n",
"fig.add_annotation(x=12, y=5, text=\"midnight\", showarrow=False)\n",
"fig.add_annotation(x=24, y=6, text=\"11 am\", showarrow=False)\n",
"fig.add_annotation(x=36, y=8, text=\"7 pm\", showarrow=False)\n",
"\n",
"fig.update_xaxes(showticklabels=False)\n",
"fig.update_layout(width=500, height=250, showlegend=False)\n",
"\n",
"\n",
"fig.update_xaxes(showticklabels = False)\n",
"\n",
"fig.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot shows us how the air quality worsens throughout the day, but in each of these half-hour intervals, the air quality is roughly constant at 5.4, 6.6, and 8.6 μg/m3 at midnight, eleven in the morning, and seven in the evening, respectively. We can think of the data scope as follows: at this particular location in a specific half-hour time interval, there is an average particle concentration in the air surrounding the sensor. This concentration is our target, and our instrument, the sensor, takes many measurements that form a sample from the access frame. (See {numref}`Chapter %s <ch:data_scope>` for the dart board analogy of this process). If the instrument is working properly, the measurements are centered around the target--the 30-minute average. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -2328,13 +2327,14 @@
"source": [
"fig = px.histogram(pm, x='diff30', nbins=20,\n",
" labels={'diff30':'Deviation from 30-Minute Median'},\n",
" color_discrete_sequence=[\"lightgrey\"])\n",
" color_discrete_sequence=[\"lightgrey\"])\n",
"\n",
"fig.update_xaxes(range=[-1.7, 1.8])\n",
"fig.update_layout(width=350, height=250) "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -2362,13 +2362,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Given the hourly measurements range from 5 to 9 μg/m3, the relative error is 8% to 12%, which is reasonably accurate."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -2396,6 +2398,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -3567,35 +3570,31 @@
"source": [
"times = np.arange(1, 16)\n",
"\n",
"fig = px.line(pm, x=\"time\", y=\"aq2.5\", color=\"hour\",\n",
" labels={\n",
" \"time\": \"2-minute intervals in a half hour\",\n",
" \"aq2.5\": \"Particulate Matter (2-min avg)\"\n",
" },)\n",
"\n",
"fig.add_trace(go.Scatter(x=times, y=aq_imitate,\n",
" mode='lines'))\n",
"\n",
"fig.add_annotation(x=12, y=5,\n",
" text=\"midnight\", showarrow=False)\n",
"fig = px.line(\n",
" pm,\n",
" x=\"time\",\n",
" y=\"aq2.5\",\n",
" color=\"hour\",\n",
" labels={\n",
" \"time\": \"2-minute intervals in a half hour\",\n",
" \"aq2.5\": \"Particulate Matter (2-min avg)\",\n",
" },\n",
")\n",
"\n",
"fig.add_annotation(x=24, y=6,\n",
" text=\"11 am\", showarrow=False)\n",
"\n",
"fig.add_annotation(x=36, y=8,\n",
" text=\"7 pm\", showarrow=False)\n",
"\n",
"fig.add_annotation(x=52, y=9.7,\n",
" text=\"simulated\", showarrow=False)\n",
"\n",
"fig.update_xaxes(showticklabels = False)\n",
"fig.add_trace(go.Scatter(x=times, y=aq_imitate, mode=\"lines\"))\n",
"\n",
"fig.add_annotation(x=12, y=5, text=\"midnight\", showarrow=False)\n",
"fig.add_annotation(x=24, y=6, text=\"11 am\", showarrow=False)\n",
"fig.add_annotation(x=36, y=8, text=\"7 pm\", showarrow=False)\n",
"fig.add_annotation(x=52, y=9.7, text=\"simulated\", showarrow=False)\n",
"fig.update_xaxes(showticklabels=False)\n",
"fig.update_layout(width=500, height=250, showlegend=False)\n",
"\n",
"fig.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -3613,21 +3612,9 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
}
},
"nbformat": 4,
Expand Down

0 comments on commit e8b15c4

Please sign in to comment.