infer_w_survey_design_usingR.rmarkdown

---
title: "Survey design-informed inference with British Social Attitudes Survey data using R"
author: "UK Data Service"
date: last-modified
date-format: "MMMM YYYY"
mainfont: "Arial"
title-block-banner: "white"
title-block-banner-color: "#742082"
format:
#  docx: default
  html:
   toc: true
   smooth-scroll: true
   toc-location: left
   css: ukds.css
execute:
  warning: false
---


This exercise is part of the ['Introduction to the British Social Attitudes Survey (BSA)'](https://trainingmodules.ukdataservice.ac.uk/attitudes/#/){target="_blank" rel="noopener"} online module.
 In this exercise, we will practice statistical inference with data from the [British Social Attitudes Survey (BSA) 2017](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8450){target="_blank" rel="noopener"} using weights and survey design variables.  

Please note that at the time of writing this document only some of the BSA editions include survey design variables. For more information about inference from social surveys, including cases where weights and/or survey design variables are not available, please consult [our guidelines](https://ukdataservice.ac.uk/learning-hub/survey-data/){target="_blank" rel="noopener"}.  

Answers to the questions asked throughout the exercise can be found at the end of the page.

### Getting started

Data can be downloaded from the [UK Data Service website](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8450){target="_blank" rel="noopener"} following [registration](https://ukdataservice.ac.uk/help/registration/registration-login-faqs/){target="_blank" rel="noopener"}. Download the compressed folder, unzip and save it somewhere accessible on your computer.

The examples below assume that the dataset has been saved in a  new folder named *UKDS* on your Desktop (Windows computers). The path would typically be `C:\Users\YOUR_USER_NAME\Desktop\UKDS`. Feel free to change it to the location that best suits your needs 

The code  below will need to be adjusted in order to match the location of the data on your computer. 

We begin by  loading  the R packages needed for the exercise and set the working directory.


```{r eval=F}
library(dplyr) ### Data manipulation functions
library(haven) ### Functions for importing data from commercial packages
library(Hmisc) ### Extra statistical functions
library(survey) ### Survey design functions

### Setting up the working directory
### Change the setwd() command  to match the location of the data on your computer 
### if required 

setwd("C:\Users\Your_Username_here\")

getwd()

# Opening the BSA dataset in SPSS format
bsa17<-read_spss("data/UKDA-8450-spss/spss/spss25/bsa2017_for_ukda.sav")
```


`
[1] C:\Users\Your_Username_here\
`


```{r files, echo=F,output=F}
library(dplyr) ### Data manipulation functions
library(haven) ### Importing stata/SPSS files
library(Hmisc) ### Extra statistical functions
library(survey) ### Survey design functions


### Setting up the working directory
### Change the setwd() command  to match the location of the data on your computer  if required 
### setwd(`C:\Users\Your_Username_here\`)
setwd("~/Documents/UKDS/data")

getwd()

# Opening the BSA dataset in SPSS format
bsa17<-read_spss("UKDA-8450-spss/spss/spss25/bsa2017_for_ukda.sav")
```


### 1. Identifying the survey design and variables 

We first need to find out about the survey design that was used in the BSA 2017, and the design variables available in the dataset. Such information can usually be found in the documentation that comes together with the data under the `mrdoc/pdf` folder or  in the data catalogue pages for the data on the [UK Data Service website](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8450#!/documentation){target="_blank" rel="noopener"}.

**Question 1**

What is the design that was used in this survey (i.e. how many sampling stages were there, and what were the units sampled). What were the primary sampling units; the strata (if relevant)? 

Now that we are a bit more familiar with the way the survey was designed, we need to try and identify the design variables we can include when producing estimates. The information can usually be found in the data documentation or the data dictionary available in the BSA documentation.  

**Question 2**

What survey design variables are available? Are there any that are missing – if so which ones? What is the name of the weights variables? 

### 2. Specifying the survey design 
We need to tell R about the survey design. In practice this often means specifying the units selected at the initial sampling stage ie the *Primary Sampling Units*, as well as the strata. This is achieved with the `svydesign()` command. In effect this command creates a copy of the dataset with the survey design information attached, that can then subsequently be used for further estimation.


```{r svy}
bsa17.s<-svydesign(ids=~Spoint, strata=~StratID, weights=~WtFactor,data=bsa17)
class(bsa17.s)
summary(bsa17.s) ### Warning: very long output
```

### 3. Mean age and its 95% confidence interval
We can now produce a first set of estimates using this information and compare them with those we would have got without accounting for  the survey design. We will compute the average (ie mean) age of respondents in the sample. We will need to use `svymean()`

```{r mean}
svymean(~RAgeE,bsa17.s)
```

 By default  `svymean()` computes the standard error of the mean. We need to  
 embed it within `confint()` in order to get a confidence interval. 

```{r ci}
confint(svymean(~RAgeE,bsa17.s)) ### Just the confidence interval...
round(
  c(
    svymean(~RAgeE,bsa17.s),
    confint(svymean(~RAgeE,bsa17.s))
    ),
  1) ### ... Or both, rounded
```

*What difference would it make to the estimates and 95% CI to compute respectively, an unweighted mean, as well as a weighted mean without accounting for the survey design?*

There are different ways of computing 'naive estimates' in R. Below we demonstrate how to do it ´by hand' for greater transparency.

Base R provides a function for computing the variance of a variable: `var()`. Since we know that:

- The standard deviation of the mean is the square root of its variance
- The standard  error of a sample mean is its standard deviation divided by the square root of the sample size
- A 95% confidence interval is the sample mean respectively minus and plus 1.96 times its standard error. 
It is then relatively straightforward to compute unweighted and 'casually weighted' confidences intervals for the mean.

```{r ci_uw}
### Unweighted means and CI
u.m<- mean(bsa17$RAgeE)
u.se<-sqrt(var(bsa17$RAgeE))/sqrt(length(bsa17$RAgeE))
u.ci<-c(u.m - 1.96*u.se,u.m + 1.96*u.se)
round(c(u.m,u.ci),1)

### Weighted means and CI without survey design
w.m<- wtd.mean(bsa17$RAgeE,bsa17$WtFactor)
w.se<-sqrt(wtd.var(bsa17$RAgeE,bsa17$WtFactor))/sqrt(length(bsa17$RAgeE))
w.ci<-c(w.m - 1.96*w.se,w.m + 1.96*w.se)
round(c(w.m,w.ci),1)
```


**Question 3**
What are the consequences of not accounting for the sample design; not using  weights and accounting for the sample design when:
- inferring the mean value of the population age?
- inferring the uncertainty  of our estimate of the population age? 

### 4. Computing a proportion and its 95% confidence interval
We can now similarly estimate the distribution of a categorical variable in the population by computing  proportions (or percentages), for instance, the proportion of people who declare themselves interested in politics. This is the `Politics` variable. It has five categories that we are going to recode into 'Significantly' (interested) and 'Not' (significantly), for simplicity. The BSA regards 'don't know' and 'refusal' responses as valid but in this case there is only one 'don't know' and no 'refusal' responses, so we can safely ignore these categories.   


```{r polrecode}
attr(bsa17$Politics,"label")     ### Phrasing of the question
table(as_factor(bsa17$Politics)) ### Sample distribution

bsa17$Politics.s<-ifelse(bsa17$Politics==1 | bsa17$Politics==2,
                         "Significantly",NA)
bsa17$Politics.s<-ifelse(bsa17$Politics>=3 & bsa17$Politics<=5,
                         "Not Interested",bsa17$Politics.s)
bsa17$Politics.s<-as.factor(bsa17$Politics.s)
rbind(table(bsa17$Politics.s),
      round(100*prop.table(table(bsa17$Politics.s)),1)
)
``` 

Changes in a data frame are not automatically transferred into `svydesign` objects  used for inferences. We therefore need to recreate it each time  we create or recode a variable.


```{r poltab}

rbind(round(wtd.table(bsa17$Politics.s,bsa17$WtFactor)$sum.of.weights,1),
      round(100*prop.table(wtd.table(bsa17$Politics.s,bsa17$WtFactor)$sum.of.weights),1)
)

bsa17.s<-svydesign(ids=~Spoint, strata=~StratID, weights=~WtFactor,data=bsa17)

rbind(round(svytable(~Politics.s,bsa17.s),1),
      round(100*prop.table(svytable(~Politics.s,bsa17.s)),1)
)
```

As with the mean of age earlier, we can see that the weighted and unweighted point estimates of the proportion of respondents  significantly interested in politics  differ, even if  slightly, and that weighted point estimates  do not differ irrespective of the  survey design being accounted for.

Let us now examine the confidence intervals of these proportions. Traditional statistical software usually compute these without telling  us about the underlying computations going on.  By contrast, doing this in R requires more coding, but in the process we gain a better understanding of what is actually estimated. 

Confidence intervals for proportion of categorical variables are usually computed as a sequence   of binomial/dichotomic estimations -- ie one for each category. In R this needs to be specified explicitly via the `svyciprop()` and `I()` functions. The former actually computes the proportion and its confidence interval (by default 95%), whereas the latter allows us to define the category we are focusing on (in case of non dichotomic variable).


```{r cipoltab}
svyciprop(~I(Politics.s=="Significantly"),bsa17.s)
round(100*
        c(prop.table(svytable(~Politics.s,bsa17.s))[2],
attr(svyciprop(~I(Politics.s=="Significantly"),bsa17.s),"ci")),1
)
```                     

**Question 4**

What is the proportion of respondents aged 17-34 in the sample, as well as its 95% confidence interval? You can use `RAgecat5`

### 5. Domain (ie subpopulation) estimates
Computing estimates for specific groups of a sample (for example the average age of people who reported being interested in politics) is not much more difficult than  doing it for the sample as a whole. However doing it as part of an inferential analysis requires some caution. Calculating weighted estimates  for  a subpopulation, amounts to computing second order estimates ie an estimate for a group whose size needs to be estimated first. Therefore, attempting this while  leaving out of  the rest of the sample might  yield incorrect results. This is why using survey design informed functions is particularly recommended in such cases.

The `survey` package function`svyby()` makes such domain estimation relatively straightforward. For instance, if we would like to compute the mean age of BSA respondents by Government Office Regions, we need to specify:

- The outcome variable whose estimate we want to compute: ie `RAgeE`
- The grouping variable(s) `GOR_ID`
- The estimate function we are going to use here: `svymean`, the same as  we used before
- And the type of type of variance estimation we would like to see displayed ie standard errors or confidence interval  

```{r bygor}
bsa17$gor.f<-as_factor(bsa17$GOR_ID)
bsa17.s<-svydesign(ids=~Spoint, strata=~StratID, weights=~WtFactor,data=bsa17)

round(svyby(~RAgeE,by=~gor.f,svymean,design=bsa17.s,vartype = "ci")[-1],1)
```

 *Note:* we used `[-1]` from the object created by `svyby()` in order to remove a column with alphanumeric values (the region names), so that we could round the results without getting an error.
 
 Our inference seem to suggest that the population in  London is among the youngest in the country, and that those in the South West are among the oldest -- their respective 95% confidence intervals do not overlap. We should not feel  so confident about differences between London and the South East for example, as the CIs partially overlap.  

 We can follow a similar approach with proportions: we just need to specify the category of the variable we are interested in as an outcome, for instance respondents who are significantly interested in politics, and replace `svymean` by `svyciprop`.


```{r bygorprop}
round(
      100*
      svyby(~I(Politics.s=="Significantly"),
            by=~gor.f,
            svyciprop,
            design=bsa17.s,
            vartype = "ci")[-1],
            1)
```

 **Question 5**
 
What is the 95% confidence interval for the proportion of people interested in politics in the South West? Is the proportion likely to be different in London? In what way? What is the region of the UK for which the precision of the estimates is likely to be the smallest?

When using `svyby()`, we can  define domains or subpopulations with several variables, not just one. For example, we could have looked at gender differences in political affiliations by regions. However, as the size of subgroups decrease, so does the precision of the estimates as their confidence interval widens, to a point where their substantive interest is not meaningful anymore. 

 **Question 6**
 
Using interest in politics as before, and three category age 
`RAgecat5` (which you may want to recode as a factor in order to improve  display clarity): 

- Produce a table of results showing the proportion of respondents significantly interested in Politics by age group


- Assess whether the age difference in interest for politics is similar for each gender?

- Based on the data, is it fair to say that men aged under 35 tend to  be more likely to declare  themselves  interested  in politics  than women aged 55 and above?


### Answers

**Question 1**
The 2017 BSA is a three stage stratified random survey, with postcode sectors, adresses and individuals as the units selected at each stage. Primary sampling units were furthermore stratified according to geographies (sub regions), population density, and proportion of owner-occupiers. Sampling rate was proportional to the size of postcode sectors (ie number of addresses)

**Question 2**
From the Data Dictionary it appears that the primary sampling units (sub regions) are identified by```Spoint``` and the strata by```StratID```. The weights variable is```WtFactor```. Addresses are not provided but could be approximated with a household identifier. 


**Question 3**
Not using weights would make us overestimate the mean age in the population (of those aged 16+) by about 4 years. This is likely to be due to the fact that older respondents are more likely to take part to surveys. Using  survey design variables does not alter the value of the estimated population mean. However, not accounting for them would lead us to overestimate the precision/underestimate the uncertainty of our estimate with a narrower confidence interval -- by about plus and minus 2 months . 


**Question 4**
The proportion of 17-25 year old in the sample is `r round(100*as.numeric(svyciprop(~I(RAgecat5 == 1),bsa17.s)[1]),1)` and its 95%confidence interval `r round(100*attr(svyciprop(~I(RAgecat5 == 1),bsa17.s),"ci"),1)`

**Question 5** 
The 95% confidence interval for the proportion of people interested in politics in the South West is `r round(100*svyby(~I(Politics.s=="Significantly"),by=~gor.f,svyciprop,design=bsa17.s,vartype = "ci")[-1],1)[9,c(2,3)]`. By contrast, it is likely to be `r round(100*svyby(~I(Politics.s=="Significantly"),by=~gor.f,svyciprop,design=bsa17.s,vartype = "ci")[-1],1)[7,c(2,3)]` in London. The region with the lowest precision of estimates (ie the widest confidence interval) is Wales, with a  `r as.numeric(round(100*svyby(~I(Politics.s=="Significantly"),by=~gor.f,svyciprop,design=bsa17.s,vartype = "ci")[-1],1)[10,3]-round(100*svyby(~I(Politics.s=="Significantly"),by=~gor.f,svyciprop,design=bsa17.s,vartype = "ci")[-1],1)[10,2])`  percentage point difference between the upper and lower bounds of the confidence interval.


**Question 6**


```{r 6.1}
bsa17$RAgecat5.f<-as_factor(bsa17$RAgecat5)
bsa17$Rsex.f<-as_factor(bsa17$Rsex)


bsa17.s<-svydesign(ids=~Spoint, strata=~StratID, weights=~WtFactor,data=bsa17)

round(
      100*
      svyby(~I(Politics.s=="Significantly"),
            by=~RAgecat5.f+Rsex.f,
            svyciprop,
            design=bsa17.s,
            vartype = "ci")[c(-8,-4),c(-2,-1)],
      1)
``` 

Older respondents both male and female tend to be more involved in politics than younger ones.

The confidence interval for the proportion of men under 35 and women above 55 interested in politics overlaps; it is unlikely that they  differ in the population.


<!-- # Find number of obs per type -->
<!-- rcat<-wtd.table(bsa17$RSexAge2,bsa17$WtFactor)$sum.of.weights -->
<!-- n=length(bsa17$RSexAge2) -->
<!-- # Get the proportion -->
<!-- p_hat = rcat[1]/n -->
<!-- alpha=.95 -->
<!-- # Calculate the critical z-score -->
<!-- z = qnorm(1-alpha/2) -->

<!-- # Compute the CI -->
<!-- 100*(p_hat + c(-1,1)*z*sqrt(p_hat*(1-p_hat)/n)) -->

<!-- svyby(~RAgeE,by=~GOR_ID+Rsex,svymean,design=bsa17.s,vartype = "ci") -->

<!-- svyby(~I(Politics.s=="Not"),by=~GOR_ID+Rsex,svyciprop,design=bsa17.s,vartype="ci") -->

<!-- round(100*prop.table(ftable(svytable(~GOR_ID+Rsex+Politics.s,design=bsa17.s)),1),1) -->

<!-- round(100*prop.table(ftable(svytable(~GOR_ID+Rsex+Politics.s,design=bsa17.s)),1),1) -->