added intro to HT section of Ch 17

DS-100 · May 15, 2023 · d6eac45 · d6eac45
1 parent 644f72b
commit d6eac45
Show file tree

Hide file tree

Showing 6 changed files with 83 additions and 59 deletions.
diff --git a/content/ch/17/inf_pred_gen_CI.ipynb b/content/ch/17/inf_pred_gen_CI.ipynb
@@ -115,24 +115,13 @@
    "source": [
     "We use the normal confidence interval when the sampling distribution is well-approximated by a normal curve. For a normal probability distribution, with center $\\mu$ and spread $\\sigma$, there is a 95% chance that a random value from this distribution is in the interval $\\mu ~\\pm ~ 1.96 \\sigma$. Since the center of the sampling distribution is typically $\\theta^*$, the chance is 95% that for a randomly generated $\\hat{\\theta}$: \n",
     "\n",
-    "$$|\\hat{\\theta} -\\theta^*| \\leq 1.96 SD(\\hat{\\theta}),$$\n",
+    "$$|\\hat{\\theta} -\\theta^*| \\leq 1.96 SE(\\hat{\\theta}),$$\n",
     "\n",
-    "where $SD(\\hat{\\theta})$ is the spread of the sampling distribution of $\\hat{\\theta}$. We use this inequality to make a 95% confidence interval for $\\theta^*$:\n",
+    "where $SE(\\hat{\\theta})$ is the spread of the sampling distribution of $\\hat{\\theta}$. We use this inequality to make a 95% confidence interval for $\\theta^*$:\n",
     "\n",
-    "$$ [ \\hat{\\theta} ~-~ 1.96 SD(\\hat{\\theta}),~~~  \\hat{\\theta} ~ +~ 1.96 SD(\\hat{\\theta})]$$\n",
+    "$$ [ \\hat{\\theta} ~-~ 1.96 SE(\\hat{\\theta}),~~~  \\hat{\\theta} ~ +~ 1.96 SE(\\hat{\\theta})]$$\n",
     "\n",
-    "Other size confidence intervals can be formed with different multiples of $SD(\\hat{\\theta})$, all based on the normal curve. For example, a 99% confidence interval is $\\pm 2.58 SE$, and a one-sided upper 95% confidence interval is $[ \\hat{\\theta} ~-~ 1.64 SE(\\hat{\\theta}),~~ \\infty]$."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    ":::{note}\n",
-    "\n",
-    "Confidence intervals can be easily misinterpreted as the chance that the parameter $\\theta^*$ is in the interval. However, the confidence interval is created from one realization of the sampling distribution. The sampling distribution gives us a different probability statement, 95% of the time, an interval constructed in this way will contain $\\theta^*$. Unfortunately, we don't know whether this particular time is one of those that happens 95 times in 100, or not. That is why, the term \"confidence\" is used rather than \"probability\" or \"chance\", and we say that we are 95% confident that the parameter is in our interval. \n",
-    "\n",
-    ":::"
+    "Other size confidence intervals can be formed with different multiples of $SE(\\hat{\\theta})$, all based on the normal curve. For example, a 99% confidence interval is $\\pm 2.58 SE$, and a one-sided upper 95% confidence interval is $[ \\hat{\\theta} ~-~ 1.64 SE(\\hat{\\theta}),~~ \\infty]$."
    ]
   },
   {
@@ -157,7 +146,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Confidence intervals for a coefficient\n",
+    "## Confidence Intervals for a Coefficient\n",
     "\n",
     "Earlier in this chapter we tested the hypothesis that the coefficient for humidity in a linear model for air quality is 0. The fitted coefficient for these data was $0.21$. Since the null model did not completely specify the data generation mechanism, we resorted to bootstrapping. That is, we used the data as the population, took a sample of 11,226 records with replacement from the bootstrap population, and fitted the model to find the bootstrap sample coefficient for humidity. Our simulation repeated this process 10,000 times, to get an approximate bootstrap sampling distribution."
    ]
@@ -258,6 +247,17 @@
     "There are other versions of the normal-based confidence interval that reflect the variability in estimating the standard error of the sampling distribution using the SD of the data. And still other confidence intervals for statistics that are percentiles, rather than averages. (Also note that for permutation tests, the bootstrap tends not to be as accurate as normal approximations.) "
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    ":::{note}\n",
+    "\n",
+    "Confidence intervals can be easily misinterpreted as the chance that the parameter $\\theta^*$ is in the interval. However, the confidence interval is created from one realization of the sampling distribution. The sampling distribution gives us a different probability statement, 95% of the time, an interval constructed in this way will contain $\\theta^*$. Unfortunately, we don't know whether this particular time is one of those that happens 95 times in 100, or not. That is why, the term \"confidence\" is used rather than \"probability\" or \"chance\", and we say that we are 95% confident that the parameter is in our interval. \n",
+    "\n",
+    ":::"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/content/ch/17/inf_pred_gen_HT.ipynb b/content/ch/17/inf_pred_gen_HT.ipynb
diff --git a/content/ch/17/inf_pred_gen_PI.ipynb b/content/ch/17/inf_pred_gen_PI.ipynb
diff --git a/content/ch/17/inf_pred_gen_boot.ipynb b/content/ch/17/inf_pred_gen_boot.ipynb
@@ -102,7 +102,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Boostrapping a test for a regression coefficient \n",
+    "## Boostrapping a Test for a Regression Coefficient \n",
     "\n",
     "The case study on calibrating air quality monitors (see {numref}`Chapter %s <ch:pa>`) fitted a model to adjust the measurements from an inexpensive monitor to more accurately reflected true air quality. This adjustment included a term in the model related to humidity. The fitted coefficient was about $0.2$, so that on days of high humidity the  measurement is adjusted upward more than on days of low humidity. However, this coefficient is close to 0, and we might wonder whether including humidity in the model is really needed. In other words, we want to test the hypothesis that the coefficient for humidity in the linear model is 0. Unfortunately, we can't fully specify the model because it is based on measurements taken over a particular time period from a set of air monitors (both PurpleAir and those maintained by the EPA). This is where the bootstrap can help. "
    ]

diff --git a/content/ch/17/inf_pred_gen_prob.ipynb b/content/ch/17/inf_pred_gen_prob.ipynb
@@ -41,20 +41,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "```{figure} TriptychRank.png\n",
-    "---\n",
-    "name: triptychRank\n",
-    "---\n",
-    "\n",
-    "This diagram shows the population, sampling, and sample distributions and their summaries from the Wikipedia example. In this example, the population is known to consist of the integers from 1 to 200, and the sample are the ranks of the observed post-productivity measurements for the treatment group. In the middle, the sampling distribution of the average rank is created from a simulation study. Notice it is normal in shape with a center that matches the population average. \n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Formalizing the theory for average rank statistics\n",
+    "## Formalizing the Theory for Average rank statistics\n",
     "\n",
     "Recall in the Wikipedia experiment, we pooled the post-award productivity values from the treatment and control groups and converted them into ranks, $1, 2, 3, \\ldots, 200$ so the population is simply made up of the integers from 1 to 200. {numref}`Figure %s <triptychRank>` is a diagram that represents this specific situation. Notice that the population distribution is flat and ranges from 1 to 200 (leftside of {numref}`Figure %s <triptychRank>`). Also, the population summary (called *population parameter*) we use is the average rank:\n",
     "\n",
@@ -73,6 +60,19 @@
     "The SD(pop) represents the typical deviation of a rank from the population average.   To calculate SD(pop) for this example takes some mathematical handiwork."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```{figure} TriptychRank.png\n",
+    "---\n",
+    "name: triptychRank\n",
+    "---\n",
+    "\n",
+    "This diagram shows the population, sampling, and sample distributions and their summaries from the Wikipedia example. In this example, the population is known to consist of the integers from 1 to 200, and the sample are the ranks of the observed post-productivity measurements for the treatment group. In the middle, the sampling distribution of the average rank is created from a simulation study. Notice it is normal in shape with a center that matches the population average. \n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -221,7 +221,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## General properties of random variables\n",
+    "## General Properties of Random Variables\n",
     "\n",
     "In general, a *random variable* represents a numeric outcome of a chance event. In this book, we use capital letters like $X$ or $Y$ or $Z$ to denote a random variable. The probability distribution for $X$ is the specification, $\\mathbb{P}(X = x) = p_x$ for all values $x$ that the random variable takes on. "
    ]
@@ -342,7 +342,7 @@
     "\\end{aligned}\n",
     "$$\n",
     "\n",
-    "Notice that while the expected value is the same as when the draws are without replacement, the variance and SD are smaller. These quantities are adjusted by $(N-n/(N-1)$, which is called the *finite population correction factor*.  We used this formula earlier to compute the $SD(\\hat{\\theta})$ in our Wikipedia example. \n",
+    "Notice that while the expected value is the same as when the draws are without replacement, the variance and SD are smaller. These quantities are adjusted by $(N-n)/(N-1)$, which is called the *finite population correction factor*.  We used this formula earlier to compute the $SD(\\hat{\\theta})$ in our Wikipedia example. \n",
     "\n",
     "Returning to {numref}`Figure %s <triptychRank>`, we see that the sampling distribution for $\\bar{X}$ in the center of the diagram has an expectation that matches the population average; the SD decreases like $1/\\sqrt{n}$ but even faster because we are drawing without replacement; and the distribution is shaped like a normal curve.  We saw these properties earlier in our simulation study.\n",
     "\n",
@@ -353,7 +353,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Probability behind testing and intervals\n",
+    "## Probability Behind Testing and Intervals\n",
     "\n",
     "As mentioned at the beginning of this chapter, probability is the underpinning behind conducting a hypothesis test, providing a confidence interval for an estimator and a prediction interval for a future observation.  \n",
     "\n",
@@ -537,11 +537,12 @@
     "\n",
     "$$\n",
     "\\begin{aligned}\n",
-    "\\mathbb{E} [ y_0 -  \\hat{f}(x_0)]^2  &  = \\mathbb{E} [ g(x_0) + \\epsilon_0 -  \\hat{f}(x_0)]^2  & \\textrm{definition}~\\textrm{of} ~y_0\\\\\n",
-    "  &  = \\mathbb{E} [ g(x_0) + \\epsilon_0 - \\mathbb{E}[\\hat{f}(x_0)] + \\mathbb{E}[\\hat{f}(x_0)] -  \\hat{f}(x_0)]^2 & \\textrm{adding}~ \\pm \\mathbb{E}[\\hat{f}(x_0)] \\\\\n",
-    " &  = \\mathbb{E} [ g(x_0) - \\mathbb{E}[\\hat{f}(x_0)] -  (\\hat{f}(x_0) - \\mathbb{E}[\\hat{f}(x_0)]) + \\epsilon_0]^2  & \\text{rearranging terms}\\\\\n",
-    " &  = [ g(x_0) - \\mathbb{E}[\\hat{f}(x_0)]]^2 + \\mathbb{E}[\\hat{f}(x_0) - \\mathbb{E}[\\hat{f}(x_0)]]^2 + \\sigma^2 & \\text{expanding the square} \\\\\n",
-    " & =  ~~~\\text{model bias}^2 ~~~+~~~ \\text{model variance} ~~~+~~~ \\text{error}\n",
+    "\\mathbb{E} & [ y_0 -  \\hat{f}(x_0)]^2 \\\\\n",
+    " & = \\mathbb{E} [ g(x_0) + \\epsilon_0 -  \\hat{f}(x_0)]^2  &\\textrm{definition}~\\textrm{of}~y_0\\\\\n",
+    " & = \\mathbb{E} [ g(x_0) + \\epsilon_0 - \\mathbb{E}[\\hat{f}(x_0)] + \\mathbb{E}[\\hat{f}(x_0)] -  \\hat{f}(x_0)]^2 &\\textrm{adding}~ \\pm \\mathbb{E}[\\hat{f}(x_0)] \\\\\n",
+    " & = \\mathbb{E} [ g(x_0) - \\mathbb{E}[\\hat{f}(x_0)] -  (\\hat{f}(x_0) - \\mathbb{E}[\\hat{f}(x_0)]) + \\epsilon_0]^2  &\\text{rearranging terms}\\\\\n",
+    " & = [ g(x_0) - \\mathbb{E}[\\hat{f}(x_0)]]^2 + \\mathbb{E}[\\hat{f}(x_0) - \\mathbb{E}[\\hat{f}(x_0)]]^2 + \\sigma^2 &\\text{expanding the square} \\\\\n",
+    " & =  ~~~\\text{model bias}^2 ~~~+~~~ \\text{model variance} ~~+~~ \\text{error}\n",
     "\\end{aligned}\n",
     "$$"
    ]

diff --git a/content/ch/17/inf_pred_gen_summary.ipynb b/content/ch/17/inf_pred_gen_summary.ipynb
@@ -69,8 +69,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Furthermore, at times the number of hypothesis tests or confidence intervals we are carrying out is quite large. For example, this can occur with multiple linear regression, when we have a large number of features in the model and we separately test whether each coefficient is 0. This situation can arise when we are trying to select a model from among many possibilities.  This is the topic of the next section."
+    "Furthermore, at times the number of hypothesis tests or confidence intervals we are carrying out is quite large. For example, this can occur with multiple linear regression, when we have a large number of features in the model and we separately test whether each coefficient is 0. This situation can arise when we are trying to select a model from among many possibilities.  This is the topic of the next part of this book, but first we look at a case study."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {