/
Population estimates using the BSAS with R.qmd
348 lines (245 loc) · 15.3 KB
/
Population estimates using the BSAS with R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
title: "Basic population estimates with British Social Attitudes Survey data using R"
author: "UK Data Service"
date: last-modified
date-format: "MMMM YYYY"
mainfont: "Arial"
title-block-banner: "white"
title-block-banner-color: "#742082"
format:
# docx: default
html:
toc: true
smooth-scroll: true
toc-location: left
css: ukds.css
execute:
warning: false
editor:
markdown:
wrap: sentence
---
This exercise is part of the ['Introduction to the British Social Attitudes Survey (BSA)'](https://trainingmodules.ukdataservice.ac.uk/attitudes/#/){target="_blank" rel="noopener"} online module.
In the exercise, we examine data from the 2020 British Social Attitudes survey to find out:
- what proportion of respondents said they voted remain in the EU Referendum?
- whether people think the government should raise taxes and spend more or reduce tax and cut social expenditures?
- how much people think they'll get from the State pension?
Answers to the questions asked throughout the exercise can be found at the end of the page.
### Getting started
Data can be downloaded from the [UK Data Service website](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=9005){target="_blank" rel="noopener"} following [registration](https://ukdataservice.ac.uk/help/registration/registration-login-faqs/){target="_blank" rel="noopener"}. Download the compressed folder, unzip and save it somewhere accessible on your computer.
The examples below assume that the dataset has been saved in a new folder named *UKDS* on your Desktop (Windows computers). The path would typically be `C:\Users\YOUR_USER_NAME\Desktop\UKDS`. Feel free to change it to the location that best suits your needs.
We begin by loading the R packages needed for the exercise and set the working directory.
```{r eval=F}
library(dplyr) ### Data manipulation functions
library(haven) ### Functions for importing data from
### commercial packages
library(Hmisc) ### Extra statistical functions
### Setting up the working directory
### Change the setwd() command to match the location
### of the data on your computer
### if required
setwd("C:\Users\Your_Username_here\")
getwd()
# Opening the BSA dataset in SPSS format
bsa20<-read_spss(
'UKDA-9005-spss/spss/spss25/bsa2020_archive.sav'
)
```
`
[1] C:\Users\Your_Username_here\
`
```{r files, echo=F,output=F}
library(dplyr) ### Data manipulation functions
library(haven) ### Importing stata/SPSS files
library(Hmisc) ### Extra statistical functions
### Setting up the working directory
### Change the setwd() command to match the location
### of the data on your computer if required
### setwd(`C:\Users\Your_Username_here\`)
setwd("~/Documents/UKDS/data")
getwd()
# Opening the BSA dataset in SPSS format
bsa20<-read_spss(
'UKDA-9005-spss/spss/spss25/bsa2020_archive.sav'
)
```
### 1. Explore the dataset
Start by getting an overall feel for the data.
Either inspect variables and cases in the data editor or use the code below to produce a summary of all the variables in the dataset.
```{r desc}
### Gives the number of rows (observations)
### and columns (variables)
dim(bsa20)
### List variable names in their actual
### order in the dataset
names(bsa20)
### Displays the first five
### lines of a data frame
### Beware, the output might be lengthy!
head(data.frame(bsa20))
```
**Questions**
1. What is the overall sample size?
2. How many variables are there in the dataset?
Now, focus on the three variables we will use.
**Note** In traditional statistical software packages such as SPSS or Stata, categorical variables are coded as arbitrary numbers, to which values labels are attached that describe the substantive meaning of these values.
R on the other hand can either directly deal with the value themselves as alphanumeric variables, or with its own version of categorical variables, known as 'factors'.
There aren't straightforward ways to convert SPSS or Stata labelled categorical variables into R factors.
The approach followed by the `Haven` package that we use here consist in preserving the original numeric values in the data, and add attributes that can be manipulated separately.
Attributes are a special type of R objects that have a name, and can be read using the `attr()` function.
Each variable has a 'label' and 'labels' attribute.
The former is the variable description, the latter the value labels.
Alternatively, haven-imported numeric variables can be converted into factors with levels (ie categories) reflecting the SPSS or Stata value labels, but with numeric values different from the original ones.
Let's examine the original variable description and value labels.
```{r attr}
# We can do this variable by variable...
attr(bsa20$TAXSPEND,"label")
# Or all as once
t(bsa20|>select(TAXSPEND,EUVOTWHO,PenExp2)
|>summarise_all(attr,"label"))
# The same holds with value labels
attr(bsa20$TAXSPEND,"labels")
attr(bsa20$EUVOTWHO,"labels")
```
**Question 3** What do the variables measure and how?
### 2. Missing values
Let's now examine the distribution of our three variables.
We can temporarily convert `EUVOTWHO` and `TAXSPEND` into factors using `mutate()` for a more meaningful output that include their value labels.
Review the frequency tables, examining the 'not applicable' and 'don't know' categories.
```{r summ}
bsa20%>%select(EUVOTWHO,TAXSPEND) %>%
mutate(as_factor(.)) %>%
summary()
summary(bsa20$PenExp2)
```
**Question 4** Why for EUVOTWHO and PenExp2 are there so many system missing values (NA)?
Note, you can use the documentation to check if needed.
What does this mean when it comes to interpreting the percentages?
When analysing survey data, it is sometimes convenient to set all item nonresponses such as ´Don’t know´ and ‘Prefer not to say’ as system missing so that they do not appear in the results. However, the BSA treated 'don't know' and refusals as valid responses when weights were computed. Therefore with the BSA, we can convert item non response to system missing as along as we are not planning to use the data to make inference about the British population, otherwise we might get biased results.
For the record, the code below show how to recode the missing values into system missing (NA) using separate variables. For ease of interpretation, we also convert the original numeric variable into labelled factors using `as_factor()`, so that they directly display the value labels.
```{r missing}
bsa20<-bsa20%>%mutate(
TAXSPEND.r=factor(as_factor(TAXSPEND,"labels"),
exclude = c("Prefer not to answer",
"Don't know")),
EUVOTWHO.r=factor(as_factor(EUVOTWHO,"labels"),
exclude = c("Prefer not to answer",
"I Don't remember","Not applicable",NA)),
PenExp2.r=ifelse(PenExp2==-1 | PenExp2>=9998,NA,PenExp2)
)
### Value labels need to be truncated as they are rather lengthy!
levels(bsa20$TAXSPEND.r)<-substr(levels(bsa20$TAXSPEND.r),1,14)
levels(bsa20$EUVOTWHO.r)<-substr(levels(bsa20$EUVOTWHO.r),1,6)
levels(bsa20$TAXSPEND.r)
levels(bsa20$EUVOTWHO.r)
```
### 3. Compare unweighted and weighted proportions
Let's examine the unweighted responses first.
In order to ensure coherence with the remainder of this exercise, we use `xtabs()` for categorical variables and `summary()` for continuous ones.
Unlike some other surveys, the BSA has retained observations with “Don’t knows” and ‘Does not apply’ when weights were computed. As a result, any univariate analysis aiming to make inference about the British population needs to retain these observations, otherwise the estimated results might be incorrect. As a result the code below retains them.
```{r test}
bsa20<-bsa20%>%mutate(
TAXSPEND.f=as_factor(TAXSPEND,"labels"),
EUVOTWHO.f=as_factor(EUVOTWHO,"labels"),
PenExp2.=ifelse(PenExp2==-1 | PenExp2>=9998,NA,PenExp2)
)
# As before, we can truncate factor levels for a more human-friendly output
levels(bsa20$TAXSPEND.f)<-substr(levels(bsa20$TAXSPEND.f),1,14)
levels(bsa20$EUVOTWHO.f)<-substr(levels(bsa20$EUVOTWHO.f),1,6)
round( ### Rounds the results to one decimal
100* ### Converts proportions to %
prop.table( ### Computes proportions
xtabs(~TAXSPEND.f,bsa20, ### Computes frequencies,
drop.unused.levels = T) ### Leaves out levels with 0 observations),
),
1)
round(100*prop.table(xtabs(~EUVOTWHO.f,bsa20,drop.unused.levels = T)),1)
summary(bsa20$PenExp2)
```
What is the (unweighted) percentage of respondents who say they voted remain in the EU referendum?
About `r round(100*prop.table(xtabs(~EUVOTWHO.r,bsa20)))[1]` percent of sample members who voted in referendum said they voted to remain.
This figure seems a bit high (though people do not always report accurately).
Let's compare with the weighted frequencies.
We will use the `wtd.table()` from the `Hmisc` package.
The weights are specified after the variable for which we request the frequencies in the command below.
```{r weight}
# Raw output
wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)
# Converted into proportions
euv.w<-wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)
# Raw results
euv.wp<-round(
100*
prop.table(
euv.w$sum.of.weights),
1)
euv.wp
# We can easily improve the output
cbind(euv.w$x,"Weighted %"=euv.wp)
```
Now, what proportion say they voted remain in the EU referendum?
It is about `r round(100*prop.table(wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)$sum.of.weights))[1]` percent, lower than the unweighted proportion and closer to the actual referendum results.
Do you have an idea as to why this might be the case?
### 4. Confidence intervals
So far, we have just computed point estimates without worrying about their precision. Estimates precision (or uncertainty) does matter insofar as it determines how big the ranges within which 'true' population values are likely to be. These are also known as the *confidence intervals* of our estimates.
In this exercise, we will be computing confidence intervals ‘by hand‘ and ignore the survey design (ie whether clustering or stratification were used when collecting the sample) as the information is not available in this edition of the BSA. This amounts to assuming that the sample was collected using simple random sampling - which wasn’t the case - and increase the bias of our estimates.
In this exercise, we will be computing confidence intervals ‘by hand‘ and ignore the survey design (ie whether clustering or stratification were used when collecting the sample) as the information is not available in this edition of the BSA. This amounts to assuming that the sample was collected using simple random sampling - which wasn’t the case - and increase the bias of our estimates.
We will explore the more reliable survey design functions provided by the `survey` package in the next exercise.
The `Hmisc` package provides `binconf()` a handy function to compute confidence intervals for proportions.
Although this is usually left under the hood by traditional statistical packages, estimating confidence intervals for proportions of categorical variables necessitates looking at each one of them individually. In other words, we need to compute one set of confidence interval for each one of the categories of `TAXSPEND.f` and `EUVOTWHO.f` separately.
We can then gather them into a table of results using `rbind()`.
We need to provide `binconf()` with two parameters: the frequencies for which we would like a confidence interval, and the total number of non missing observations.
```{r ci}
### Raw confidence interval for EUVOTWHO, unweighted
binconf(table(bsa20$EUVOTWHO.f=="Remain"),sum(!is.na(bsa20$EUVOTWHO.r)))
### We can convert the output into rounded percentages for better readability.
round(100*
binconf(table(bsa20$EUVOTWHO.f=="Remain"),sum(!is.na(bsa20$EUVOTWHO.f)))[1,],
1)
```
We can adapt the syntax above to make it work with weighted frequencies:
```{r ciprop1}
round(100*
binconf(wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)$sum.of.weights[2],
sum(wtd.table(bsa20$EUVOTWHO.f,weights=bsa20$BSA20_wt_new)$sum.of.weights)),
1)
```
What are the differences between weighted and unweighted confidence intervals for the proportion of people who voted remain?
Let us now do the same with people's views about government tax and spending.
```{r ciprop2}
w.n<-sum(wtd.table(bsa20$TAXSPEND.f,weights=bsa20$BSA20_wt_new)$sum.of.weights)
ciprop<-cbind(levels(bsa20$TAXSPEND.f),
round(100*
binconf(wtd.table(bsa20$TAXSPEND.f,weights=bsa20$BSA20_wt_new)$sum.of.weights,w.n),
1)
)
ciprop
```
When computing confidence intervals for means, two steps are usually needed, whether embedded in a single line of code or not: compute the mean (or any other estimate), then the confidence interval itself using `confint`.
We also use the `round()` function in order to remove unneeded decimal values.
**Question 5.** What proportion think government should increase taxes and spend more on health, education and social benefits?
Several R packages offer functions for computing confidence intervals and standard errors of means.
Here again, we privilege doing things by hand in order to properly undertstand what is happening in the background.
Under assumptions of simple random sampling, a 95% confidence of the mean is defined as plus or minus 1.96 times the standard error of the mean.
The standard error of the mean itself is the standard error of the mean (that is, the square root of its variance) divided by the square of the sample size.
Since we have functions for computing weighted means and variance in R, we can compute:
```{r cimean}
m.p<-wtd.mean(bsa20$PenExp2.r,weights=bsa20$BSA20_wt_new)
se.p<-sqrt(wtd.var(bsa20$PenExp2.r,weights=bsa20$BSA20_wt_new))
n<-sum(bsa20$BSA20_wt_new[!is.na(bsa20$PenExp2)])
ci<-c(m.p,m.p-1.96*(se.p/sqrt(n)),m.p+1.96*(se.p/sqrt(n)))
round(ci,1)
```
**Question 6** How much do people think they will get at state pension age?
### Answers
1. There are `r nrow(bsa20)` cases in the dataset.
2. The total number of variables is `r ncol(bsa20)`.
3. *`TAXSPEND` records responses to the questions of whether government should reduce/increase/maintain levels of taxation and spending. There are three possible responses to the question.* `EUVOTWHO` records responses to the question 'Did you vote to 'remain a member of the EU' or to 'leave the EU'?'
The responses are 'Remain' or 'Leave'.
\*`PenExp2` contains responses to the question 'How much do you think someone who reaches State Pension age today would receive in pounds per week?'
Responses are numeric.
4. There are two reasons for the many not applicable.
- Routing: the question is only asked to those who said yes to a previous question (EURefV2).
- Versions 5 and 6 - The BSA uses a split sample and the question is only asked in Versions 5 and 6.
5. Between `r ciprop[3,3]` and `r ciprop[3,4]`% in the population say the government should increase taxes and spend more.
6. The amount people think they will get at state pension age varies between £`r round(ci[2])` and £`r round(ci[3])`, with an average (ie mean) in the region of £`r round(ci[1])`.