Population estimates using the BSAS with R.qmd

---
title: "Basic population estimates with British Social Attitudes Survey data using R"
author: "UK Data Service"
date: last-modified
date-format: "MMMM YYYY"
mainfont: "Arial"
title-block-banner: "white"
title-block-banner-color: "#742082"
format:
#  docx: default
  html:
   toc: true
   smooth-scroll: true
   toc-location: left
   css: ukds.css
execute:
  warning: false
editor: 
  markdown: 
    wrap: sentence
---

This exercise is part of the ['Introduction to the British Social Attitudes Survey (BSA)'](https://trainingmodules.ukdataservice.ac.uk/attitudes/#/){target="_blank" rel="noopener"} online module.
In the exercise, we examine data from the 2020 British Social Attitudes survey to find out:

-   what proportion of respondents said they voted remain in the EU Referendum?

-   whether people think the government should raise taxes and spend more or reduce tax and cut social expenditures?

-   how much people think they'll get from the State pension?

Answers to the questions asked throughout the exercise can be found at the end of the page.

### Getting started

Data can be downloaded from the [UK Data Service website](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=9005){target="_blank" rel="noopener"} following [registration](https://ukdataservice.ac.uk/help/registration/registration-login-faqs/){target="_blank" rel="noopener"}. Download the compressed folder, unzip and save it somewhere accessible on your computer.

The examples below assume that the dataset has been saved in a  new folder named *UKDS* on your Desktop (Windows computers). The path would typically be `C:\Users\YOUR_USER_NAME\Desktop\UKDS`. Feel free to change it to the location that best suits your needs. 

We begin by  loading  the R packages needed for the exercise and set the working directory.

```{r eval=F}
library(dplyr) ### Data manipulation functions
library(haven) ### Functions for importing data from 
               ### commercial packages
library(Hmisc) ### Extra statistical functions

### Setting up the working directory
### Change the setwd() command  to match the location 
### of the data on your computer 
### if required 

setwd("C:\Users\Your_Username_here\")

getwd()

# Opening the BSA dataset in SPSS format

bsa20<-read_spss(
       'UKDA-9005-spss/spss/spss25/bsa2020_archive.sav'
       ) 
```

`
[1] C:\Users\Your_Username_here\
`

```{r files, echo=F,output=F}
library(dplyr) ### Data manipulation functions
library(haven) ### Importing stata/SPSS files
library(Hmisc) ### Extra statistical functions

### Setting up the working directory
### Change the setwd() command  to match the location 
### of the data on your computer  if required 
### setwd(`C:\Users\Your_Username_here\`)

setwd("~/Documents/UKDS/data")

getwd()

# Opening the BSA dataset in SPSS format
bsa20<-read_spss(
       'UKDA-9005-spss/spss/spss25/bsa2020_archive.sav'
       ) 
```

### 1. Explore the dataset

Start by getting an overall feel for the data.
Either inspect variables and cases in the data editor or use the code below to produce a summary of all the variables in the dataset.

```{r desc}
### Gives the number of rows (observations) 
### and columns (variables)
dim(bsa20) 

 ### List variable names in their actual 
 ### order in the dataset
names(bsa20)

### Displays the first five
### lines of a data frame
### Beware, the output might be lengthy!
head(data.frame(bsa20))  
                        
```

**Questions**

1.  What is the overall sample size?
2.  How many variables are there in the dataset?

Now, focus on the three variables we will use.

**Note** In traditional statistical software packages such as SPSS or Stata, categorical variables are coded as arbitrary numbers, to which values labels are attached that describe the substantive meaning of these values.
R on the other hand can either directly deal with the value themselves as alphanumeric variables, or with its own version of categorical variables, known as 'factors'.
There aren't straightforward ways to convert SPSS or Stata labelled categorical variables into R factors.
The approach followed by the `Haven` package that we use here consist in preserving the original numeric values in the data, and add attributes that can be manipulated separately.
Attributes are a special type of R objects that have a name, and can be read using the `attr()` function.
Each variable has a 'label' and 'labels' attribute.
The former is the variable description, the latter the value labels.
Alternatively, haven-imported numeric variables can be converted into factors with levels (ie categories) reflecting the SPSS or Stata value labels, but with numeric values different from the original ones.

Let's examine the original variable description and value labels.


```{r attr}
# We can do this variable by variable...
attr(bsa20$TAXSPEND,"label")

# Or all as once
t(bsa20|>select(TAXSPEND,EUVOTWHO,PenExp2)
       |>summarise_all(attr,"label"))

# The same holds with value labels
attr(bsa20$TAXSPEND,"labels")
attr(bsa20$EUVOTWHO,"labels")

```


**Question 3** What do the variables measure and how?

### 2. Missing values

Let's now examine the distribution of our three variables.
We can temporarily convert `EUVOTWHO` and `TAXSPEND` into factors using `mutate()` for a more meaningful output that include their value labels.
Review the frequency tables, examining the 'not applicable' and 'don't know' categories.

```{r summ}
bsa20%>%select(EUVOTWHO,TAXSPEND) %>%
       mutate(as_factor(.)) %>%
       summary()

summary(bsa20$PenExp2)
```

**Question 4** Why for EUVOTWHO and PenExp2 are there so many system missing values (NA)?
Note, you can use the documentation to check if needed.
What does this mean when it comes to interpreting the percentages?

When analysing survey data, it is sometimes convenient to set all item nonresponses such as ´Don’t know´ and ‘Prefer not to say’ as system missing so that they do not appear in the results. However, the BSA treated 'don't know' and refusals as valid responses when weights were computed. Therefore with the BSA, we can convert item non response to system missing as along as we are not planning to use the data to make inference about the British population, otherwise we might get biased results. 

For the record, the code below show how to recode the missing values into system missing (NA) using separate variables. For ease of interpretation, we also convert the original numeric variable into labelled factors using `as_factor()`, so that they directly display the value labels.

```{r missing}
bsa20<-bsa20%>%mutate(
              TAXSPEND.r=factor(as_factor(TAXSPEND,"labels"), 
                                exclude = c("Prefer not to answer",
                                            "Don't know")),
              EUVOTWHO.r=factor(as_factor(EUVOTWHO,"labels"),
                                exclude = c("Prefer not to answer",
                                            "I Don't remember","Not applicable",NA)),
              PenExp2.r=ifelse(PenExp2==-1 | PenExp2>=9998,NA,PenExp2)
                      )
### Value labels need to be truncated as they are rather lengthy!
levels(bsa20$TAXSPEND.r)<-substr(levels(bsa20$TAXSPEND.r),1,14)
levels(bsa20$EUVOTWHO.r)<-substr(levels(bsa20$EUVOTWHO.r),1,6)

levels(bsa20$TAXSPEND.r)
levels(bsa20$EUVOTWHO.r)

```

### 3. Compare unweighted and weighted proportions


Let's examine the unweighted responses first.
In order to ensure coherence with the remainder of this exercise, we use `xtabs()` for categorical variables and `summary()` for  continuous ones.

Unlike some other surveys, the BSA has retained observations with “Don’t knows” and ‘Does not apply’ when weights were computed. As a result, any univariate analysis aiming to make inference about the British population needs to retain these observations, otherwise the estimated results might be incorrect. As a result the code below retains them.  


```{r test}
bsa20<-bsa20%>%mutate(
              TAXSPEND.f=as_factor(TAXSPEND,"labels"), 
              EUVOTWHO.f=as_factor(EUVOTWHO,"labels"),
              PenExp2.=ifelse(PenExp2==-1 | PenExp2>=9998,NA,PenExp2)
               )

# As before, we can truncate factor levels for a more human-friendly output
levels(bsa20$TAXSPEND.f)<-substr(levels(bsa20$TAXSPEND.f),1,14)
levels(bsa20$EUVOTWHO.f)<-substr(levels(bsa20$EUVOTWHO.f),1,6)


round(                               ### Rounds the results to one decimal
  100*                               ### Converts proportions to %  
    prop.table(                      ### Computes proportions
      xtabs(~TAXSPEND.f,bsa20,       ### Computes frequencies,
             drop.unused.levels = T) ### Leaves out levels with 0 observations),
      ), 
  1)

round(100*prop.table(xtabs(~EUVOTWHO.f,bsa20,drop.unused.levels = T)),1)
summary(bsa20$PenExp2)
```

What is the (unweighted) percentage of respondents who say they voted remain in the EU referendum?
About `r round(100*prop.table(xtabs(~EUVOTWHO.r,bsa20)))[1]` percent of sample members who voted in referendum said they voted to remain.
This figure seems a bit high (though people do not always report accurately).

Let's compare with the weighted frequencies.
We will use the `wtd.table()` from the `Hmisc` package.
The weights are specified after the variable for which we request the frequencies in the command below.

```{r weight}
# Raw output
wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)

# Converted into proportions            
euv.w<-wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)

# Raw results
euv.wp<-round(
  100*
    prop.table(
      euv.w$sum.of.weights),
  1)

euv.wp

# We can easily improve the output
cbind(euv.w$x,"Weighted %"=euv.wp)

```

Now, what proportion say they voted remain in the EU referendum?
It is about `r round(100*prop.table(wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)$sum.of.weights))[1]` percent, lower than the unweighted proportion and closer to the actual referendum results.
Do you have an idea as to why this might be the case?

### 4. Confidence intervals

So far, we have just computed point estimates without worrying about their precision. Estimates precision (or uncertainty) does matter insofar as it determines how big the ranges within which  'true' population values are likely to be. These are also known as the *confidence intervals* of our estimates.

In this exercise, we will be computing confidence intervals ‘by hand‘ and ignore the survey design (ie whether clustering or stratification were used when collecting the sample) as the information is not available in this edition of the BSA. This amounts to assuming that the sample was collected using simple random sampling - which wasn’t the case - and increase the bias of our estimates. 

In this exercise, we will be computing confidence intervals ‘by hand‘ and ignore the survey design (ie whether clustering or stratification were used when collecting the sample) as the information is not available in this edition of the BSA. This amounts to assuming that the sample was collected using simple random sampling - which wasn’t the case - and increase the bias of our estimates. 

We will explore  the more reliable survey design functions provided by the `survey` package in the next exercise.

The `Hmisc` package provides `binconf()` a handy function to compute confidence intervals for proportions. 
Although this is usually left under the hood by traditional statistical packages, estimating confidence intervals for proportions of categorical variables necessitates looking at each one of them individually. In other words, we need to compute one set of confidence interval for each one of the categories of `TAXSPEND.f` and `EUVOTWHO.f` separately.
We can then gather them into a table of results using `rbind()`.

We need to provide `binconf()` with two parameters: the frequencies for which we would like a confidence interval, and the total number of non missing observations.

```{r ci}
### Raw confidence interval for EUVOTWHO, unweighted
binconf(table(bsa20$EUVOTWHO.f=="Remain"),sum(!is.na(bsa20$EUVOTWHO.r)))

### We can convert the output into rounded percentages for better readability.
round(100*
      binconf(table(bsa20$EUVOTWHO.f=="Remain"),sum(!is.na(bsa20$EUVOTWHO.f)))[1,],
      1)
```

We can adapt the syntax above to make it work with weighted frequencies:

```{r ciprop1}
round(100*
  binconf(wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)$sum.of.weights[2],
          sum(wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)$sum.of.weights)),
      1)
```

What are the differences between weighted and unweighted confidence intervals for the proportion of people who voted remain?

Let us now do the same with people's views about government tax and spending.

```{r ciprop2}
w.n<-sum(wtd.table(bsa20$TAXSPEND.f,weights=bsa20$BSA20_wt_new)$sum.of.weights)

ciprop<-cbind(levels(bsa20$TAXSPEND.f),
round(100*
binconf(wtd.table(bsa20$TAXSPEND.f,weights=bsa20$BSA20_wt_new)$sum.of.weights,w.n),
1)
)

ciprop
```

When computing confidence intervals for means, two steps are usually needed, whether embedded in a single line of code or not: compute the mean (or any other estimate), then the confidence interval itself using `confint`.
We also use the `round()` function in order to remove unneeded decimal values.

**Question 5.** What proportion think government should increase taxes and spend more on health, education and social benefits?

Several R packages offer functions for computing confidence intervals and standard errors of means.
Here again, we privilege doing things by hand in order to properly undertstand what is happening in the background.

Under assumptions of simple random sampling, a 95% confidence of the mean is defined as plus or minus 1.96 times the standard error of the mean.
The standard error of the mean itself is the standard error of the mean (that is, the square root of its variance) divided by the square of the sample size.
Since we have functions for computing weighted means and variance in R, we can compute:

```{r cimean}
m.p<-wtd.mean(bsa20$PenExp2.r,weights=bsa20$BSA20_wt_new)
se.p<-sqrt(wtd.var(bsa20$PenExp2.r,weights=bsa20$BSA20_wt_new))
n<-sum(bsa20$BSA20_wt_new[!is.na(bsa20$PenExp2)])

ci<-c(m.p,m.p-1.96*(se.p/sqrt(n)),m.p+1.96*(se.p/sqrt(n)))

round(ci,1)
```

**Question 6** How much do people think they will get at state pension age?

### Answers

1.  There are `r nrow(bsa20)` cases in the dataset.

2.  The total number of variables is `r ncol(bsa20)`.

3.  *`TAXSPEND` records responses to the questions of whether government should reduce/increase/maintain levels of taxation and spending. There are three possible responses to the question.* `EUVOTWHO` records responses to the question 'Did you vote to 'remain a member of the EU' or to 'leave the EU'?'
    The responses are 'Remain' or 'Leave'.
    \*`PenExp2` contains responses to the question 'How much do you think someone who reaches State Pension age today would receive in pounds per week?'
    Responses are numeric.

4.  There are two reasons for the many not applicable.

-   Routing: the question is only asked to those who said yes to a previous question (EURefV2).
-   Versions 5 and 6 - The BSA uses a split sample and the question is only asked in Versions 5 and 6.

5.  Between `r ciprop[3,3]` and `r ciprop[3,4]`% in the population say the government should increase taxes and spend more.

6.  The amount people think they will get at state pension age varies between £`r round(ci[2])` and £`r round(ci[3])`, with an average (ie mean) in the region of £`r round(ci[1])`.