LabAssignment5.Rmd

---
title: Lab Assignment 5
author: STA 100 | A. Farris | Spring 2021
abstract: \emph{A pdf copy of your assignment is due at 5pm Monday, May 24. Submission of the pdf will be through Gradescope. Please put in the effort to make it look reasonably professional -- you're encouraged to use R Markdown. Note that specific tasks for you are \hl{highlighted}. You are free to work in groups, but you must submit your own writeup, and run your own code.}
header-includes:
   - \usepackage{enumerate,verbatim,graphicx,soul,xcolor,nicefrac}
   - \renewcommand{\abstractname}{}
geometry: margin=0.75in
output: pdf_document
---

\section*{Part I: persistence of olfactory/taste alterations from COVID-19 and chronic asthma}

```{r, echo=FALSE, results='asis'}
#### Part I
library(knitr)
counts <- matrix(c(8,87,3,23),2,2)
rownames(counts) <- c("Asthma","No asthma")
colnames(counts) <- c("Resolved","Persistent")
pVal <- fisher.test(counts)$p.val
kable(counts)
```

A study into the persistence of olfactory/taste alterations associated with COVID-19 patients\footnote{Lovato, Andrea et al. "Clinical characteristics associated with persistent olfactory and taste alterations in COVID-19: A preliminary report on 121 patients." American journal of otolaryngology vol. 41,5 (2020): 102548. doi:10.1016/j.amjoto.2020.102548} obtained the counts in the table above. Assuming these patients to have been randomly sampled, let's investigate the chance of persistence of these alterations for patients with asthma as compared to that for those without.

From these observations, we would estimate the proportion of asthmatic cases to have persistent alterations to be $\nicefrac{3}{11}\approx `r round(3/11,4)`$, while the estimate for non-asthmatic cases would be $\nicefrac{23}{110}\approx `r round(23/110,4)`$. Clearly, we estimate the former to be larger than the latter. But, can we show beyond reasonable doubt that these proportions are different?

These estimates are themselves not enough, because they do not factor in uncertainty due to sampling: after all, these only describe the sample, not the population. In particular, notice that of the 121 patients observed here, only 11 of them had asthma. We have a much larger sample size for non-asthmatic patients than asthmatic, suggesting that we can be more confident of having a good estimate for the proportion in the latter case than the former.

So, if we were to assume that the two proportions were actually the same, what would be the chance of observing something like this result? This is answered by the p-value obtained from Fisher's Exact test, which here yields `r round(pVal,4)`. If we use, for example, significance level $\alpha=0.1$, we fail to reject the null hypothesis of independence: we conclude that there is not sufficient evidence here to reject the possibility that the proportions are the same.

Notice that we do \emph{not} conclude that the two proportions are the same--we have merely found an absence of sufficient evidence to the contrary.


\section*{Part II: Stanford seroprevalence study}


As of mid-May 2020, a key scientific question with regard to the COVID-19 pandemic was the infection fatality rate (IFR) of the disease. Unlike the case fatality rate (CFR), which measures fatalities as a fraction of detected cases, the IFR measures the proportion of all infections, whether detected or not, that result in a fatality. 

The IFR measures the cost in lives of the disease as a proportion of infections, and is therefore crucial in measuring the cost of policies that are intended to slow or stop the disease. As policymakers come under political and financial pressure to relax social distancing measures, an understanding of the cost associated with such easing is of paramount importance.

An important quantity in an understanding of the IFR is what is essentially its denominator, the number of infections that have taken place. Because in part of the dearth of reliable and systematic testing in most countries, there is substantial uncertainty surrounding this number. Furthermore, this number is of independent interest to public health authorities and policymakers, as it contains valuable information with respect to understanding how far the pandemic is from attaining some form of herd immunity.

In early April, a team of Stanford researchers began to publicize the results of a seroprevalence study that they had performed, in which they estimated the true number of infections that had taken place in Santa Clara County on the basis of a sample of people administered antibody tests. Their results, they reported, suggested that the number of infections was much higher than previously suspected, suggesting the IFR of the disease to be lower than previously suspected. The team raised eyebrows among their scientific peers for publicizing their results[^4] before they had undergone peer review, and even before they had posted a preprint[^5] of the details of their analysis for comment by the scientific community. 

Their preprint attracted substantial interest, both because of the importance of the topic and the publicity efforts on the part of the authors. Quickly, the authors came under criticism from their peers for their methods. In this lab project, we will explore one aspect of this criticism briefly.


[^4]: Stanford's Jay Bhattacharya appears on a political talk show to publicize his claimed results before peer review, Fox News, April 14, 2020: https://www.youtube.com/watch?v=6NjCitwKJSQ

[^5]: Bendavid, Eran et al. *COVID-19 Antibody Seroprevalence in Santa Clara County, California.* MedRxiv Preprint (April 17, 2020) https://www.medrxiv.org/content/10.1101/2020.04.14.20062463v1

[^6]: Revised preprint, April 30, 2020:
https://www.medrxiv.org/content/10.1101/2020.04.14.20062463v2

\subsubsection*{Accounting for specificity}


In interpreting the results of the antibody tests, it is of course important to factor in the false positive rate of the test.

The authors report the results of a validation study by the antibody test's manufacturer, in which they applied the antibody tests to 371 blood samples predating the Covid-19 pandemic. They report that 369 of these tested negative, suggesting an estimated specificity of $\nicefrac{369}{371} \approx `r round(369/371,4)`$, and an estimated false positive rate of $\nicefrac{2}{371} \approx `r round(2/371,4)`$.

From the Stanford study, the authors report that out of 3,330 patients, they found 50 to test positive, for an estimated prevalence (i.e. population proportion of infected people) of 
$\nicefrac{50}{3330} \approx `r round(50/3330,4)`$. However, this estimate does not in itself account for the possibility of false positives. 

Let's put their estimate to the test. Do we have conclusive evidence that the Stanford study found *any* true positives? The authors are certainly claiming more than this, but this is at least a first hurdle in their burden of proof.

One way to think about this is that we want to carry out a hypothesis test of the Null hypothesis that the population proportion with false positives in the manufacturer study was actually the same as that in the Stanford study. 

We have available to us a good test for evaluating whether an unknown proportion is equal to a given value, the Binomial Test. Can we use the Binomial test to evaluate whether the proportions of positive test results were the same in the manufacturers' as in the Stanford study?

No, we can't, because we don't know the true proportion in either case! In both cases, we have a sample quantity, and do not know the population quantity.

Fortunately, we have another tool that will do the trick. Suppose that the true proportion of positives is the same for patients in the Stanford study as it is for those in either of the two studies. We can then say that, for a randomly selected patient from both studies put together, the event that they test positive is independent of the event that they participated in the Stanford study. Therefore, if we organize the results into a contingency table:


```{r, echo=FALSE}
### Part II
library(knitr)
SeroprevalenceTable <- matrix(c(369,2,3280,50),2,2)
rownames(SeroprevalenceTable) <- c("Negative","Positive")
colnames(SeroprevalenceTable) <- c("Manufacturer", "Stanford")
kable(SeroprevalenceTable)
```

then a test for association between these two factors suffices.

\hl{With the data in the contingency table above, carry out a test for independence between testing positive and participation in the Stanford study, using significance level 0.05. To do so, state the Null and alternative hypotheses and report the p-value. Do you find sufficient evidence to conclude that the Stanford study shows a higher chance of testing positive than the manufacturer's test did? Interpret this result briefly in your own words. Is your result consistent with or in contradiction to the Stanford study's stated results?  }

\vspace{1in}


\subsubsection*{Postscript}

The authors changed their work to accommodate some criticisms in a revision of their preprint on April 30[^6]. They revised downward their estimates of the number infected slightly, but criticism from scientific colleagues continued[^7][^8].


[^7]: https://stanford.zoom.us/rec/play/ucIsce-r_TM3E4WduQSDCvZ6W47oeP6s1CNKrqcIzxuyB3YFOwagNONHYOOmQfdem9liCP2Uxclk_hhI?continueMode=true&_x_zm_rtaid=APFnyzyqRNuVVHvZI9ccEA.1589747689765.7e22018ae11ba4104bca3921248ece82&_x_zm_rhtaid=41 , link via https://bmir.stanford.edu/education/colloquia.html 

[^8]: Expanding on Trevor Hastie's comments in the Zoom meeting above: https://www.stat.berkeley.edu/~wfithian/overdispersionSimple.html .


\hrulefill

\clearpage
\begin{center} Appendix: R Script \end{center}

```{r, ref.label=knitr::all_labels(),echo=TRUE,eval=FALSE}
```