/
infer_w_survey_design_usingR.rmarkdown
341 lines (220 loc) · 17.3 KB
/
infer_w_survey_design_usingR.rmarkdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
title: "Survey design-informed inference with British Social Attitudes Survey data using R"
author: "UK Data Service"
date: last-modified
date-format: "MMMM YYYY"
mainfont: "Arial"
title-block-banner: "white"
title-block-banner-color: "#742082"
format:
# docx: default
html:
toc: true
smooth-scroll: true
toc-location: left
css: ukds.css
execute:
warning: false
---
This exercise is part of the ['Introduction to the British Social Attitudes Survey (BSA)'](https://trainingmodules.ukdataservice.ac.uk/attitudes/#/){target="_blank" rel="noopener"} online module.
In this exercise, we will practice statistical inference with data from the [British Social Attitudes Survey (BSA) 2017](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8450){target="_blank" rel="noopener"} using weights and survey design variables.
Please note that at the time of writing this document only some of the BSA editions include survey design variables. For more information about inference from social surveys, including cases where weights and/or survey design variables are not available, please consult [our guidelines](https://ukdataservice.ac.uk/learning-hub/survey-data/){target="_blank" rel="noopener"}.
Answers to the questions asked throughout the exercise can be found at the end of the page.
### Getting started
Data can be downloaded from the [UK Data Service website](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8450){target="_blank" rel="noopener"} following [registration](https://ukdataservice.ac.uk/help/registration/registration-login-faqs/){target="_blank" rel="noopener"}. Download the compressed folder, unzip and save it somewhere accessible on your computer.
The examples below assume that the dataset has been saved in a new folder named *UKDS* on your Desktop (Windows computers). The path would typically be `C:\Users\YOUR_USER_NAME\Desktop\UKDS`. Feel free to change it to the location that best suits your needs
The code below will need to be adjusted in order to match the location of the data on your computer.
We begin by loading the R packages needed for the exercise and set the working directory.
```{r eval=F}
library(dplyr) ### Data manipulation functions
library(haven) ### Functions for importing data from commercial packages
library(Hmisc) ### Extra statistical functions
library(survey) ### Survey design functions
### Setting up the working directory
### Change the setwd() command to match the location of the data on your computer
### if required
setwd("C:\Users\Your_Username_here\")
getwd()
# Opening the BSA dataset in SPSS format
bsa17<-read_spss("data/UKDA-8450-spss/spss/spss25/bsa2017_for_ukda.sav")
```
`
[1] C:\Users\Your_Username_here\
`
```{r files, echo=F,output=F}
library(dplyr) ### Data manipulation functions
library(haven) ### Importing stata/SPSS files
library(Hmisc) ### Extra statistical functions
library(survey) ### Survey design functions
### Setting up the working directory
### Change the setwd() command to match the location of the data on your computer if required
### setwd(`C:\Users\Your_Username_here\`)
setwd("~/Documents/UKDS/data")
getwd()
# Opening the BSA dataset in SPSS format
bsa17<-read_spss("UKDA-8450-spss/spss/spss25/bsa2017_for_ukda.sav")
```
### 1. Identifying the survey design and variables
We first need to find out about the survey design that was used in the BSA 2017, and the design variables available in the dataset. Such information can usually be found in the documentation that comes together with the data under the `mrdoc/pdf` folder or in the data catalogue pages for the data on the [UK Data Service website](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8450#!/documentation){target="_blank" rel="noopener"}.
**Question 1**
What is the design that was used in this survey (i.e. how many sampling stages were there, and what were the units sampled). What were the primary sampling units; the strata (if relevant)?
Now that we are a bit more familiar with the way the survey was designed, we need to try and identify the design variables we can include when producing estimates. The information can usually be found in the data documentation or the data dictionary available in the BSA documentation.
**Question 2**
What survey design variables are available? Are there any that are missing – if so which ones? What is the name of the weights variables?
### 2. Specifying the survey design
We need to tell R about the survey design. In practice this often means specifying the units selected at the initial sampling stage ie the *Primary Sampling Units*, as well as the strata. This is achieved with the `svydesign()` command. In effect this command creates a copy of the dataset with the survey design information attached, that can then subsequently be used for further estimation.
```{r svy}
bsa17.s<-svydesign(ids=~Spoint, strata=~StratID, weights=~WtFactor,data=bsa17)
class(bsa17.s)
summary(bsa17.s) ### Warning: very long output
```
### 3. Mean age and its 95% confidence interval
We can now produce a first set of estimates using this information and compare them with those we would have got without accounting for the survey design. We will compute the average (ie mean) age of respondents in the sample. We will need to use `svymean()`
```{r mean}
svymean(~RAgeE,bsa17.s)
```
By default `svymean()` computes the standard error of the mean. We need to
embed it within `confint()` in order to get a confidence interval.
```{r ci}
confint(svymean(~RAgeE,bsa17.s)) ### Just the confidence interval...
round(
c(
svymean(~RAgeE,bsa17.s),
confint(svymean(~RAgeE,bsa17.s))
),
1) ### ... Or both, rounded
```
*What difference would it make to the estimates and 95% CI to compute respectively, an unweighted mean, as well as a weighted mean without accounting for the survey design?*
There are different ways of computing 'naive estimates' in R. Below we demonstrate how to do it ´by hand' for greater transparency.
Base R provides a function for computing the variance of a variable: `var()`. Since we know that:
- The standard deviation of the mean is the square root of its variance
- The standard error of a sample mean is its standard deviation divided by the square root of the sample size
- A 95% confidence interval is the sample mean respectively minus and plus 1.96 times its standard error.
It is then relatively straightforward to compute unweighted and 'casually weighted' confidences intervals for the mean.
```{r ci_uw}
### Unweighted means and CI
u.m<- mean(bsa17$RAgeE)
u.se<-sqrt(var(bsa17$RAgeE))/sqrt(length(bsa17$RAgeE))
u.ci<-c(u.m - 1.96*u.se,u.m + 1.96*u.se)
round(c(u.m,u.ci),1)
### Weighted means and CI without survey design
w.m<- wtd.mean(bsa17$RAgeE,bsa17$WtFactor)
w.se<-sqrt(wtd.var(bsa17$RAgeE,bsa17$WtFactor))/sqrt(length(bsa17$RAgeE))
w.ci<-c(w.m - 1.96*w.se,w.m + 1.96*w.se)
round(c(w.m,w.ci),1)
```
**Question 3**
What are the consequences of not accounting for the sample design; not using weights and accounting for the sample design when:
- inferring the mean value of the population age?
- inferring the uncertainty of our estimate of the population age?
### 4. Computing a proportion and its 95% confidence interval
We can now similarly estimate the distribution of a categorical variable in the population by computing proportions (or percentages), for instance, the proportion of people who declare themselves interested in politics. This is the `Politics` variable. It has five categories that we are going to recode into 'Significantly' (interested) and 'Not' (significantly), for simplicity. The BSA regards 'don't know' and 'refusal' responses as valid but in this case there is only one 'don't know' and no 'refusal' responses, so we can safely ignore these categories.
```{r polrecode}
attr(bsa17$Politics,"label") ### Phrasing of the question
table(as_factor(bsa17$Politics)) ### Sample distribution
bsa17$Politics.s<-ifelse(bsa17$Politics==1 | bsa17$Politics==2,
"Significantly",NA)
bsa17$Politics.s<-ifelse(bsa17$Politics>=3 & bsa17$Politics<=5,
"Not Interested",bsa17$Politics.s)
bsa17$Politics.s<-as.factor(bsa17$Politics.s)
rbind(table(bsa17$Politics.s),
round(100*prop.table(table(bsa17$Politics.s)),1)
)
```
Changes in a data frame are not automatically transferred into `svydesign` objects used for inferences. We therefore need to recreate it each time we create or recode a variable.
```{r poltab}
rbind(round(wtd.table(bsa17$Politics.s,bsa17$WtFactor)$sum.of.weights,1),
round(100*prop.table(wtd.table(bsa17$Politics.s,bsa17$WtFactor)$sum.of.weights),1)
)
bsa17.s<-svydesign(ids=~Spoint, strata=~StratID, weights=~WtFactor,data=bsa17)
rbind(round(svytable(~Politics.s,bsa17.s),1),
round(100*prop.table(svytable(~Politics.s,bsa17.s)),1)
)
```
As with the mean of age earlier, we can see that the weighted and unweighted point estimates of the proportion of respondents significantly interested in politics differ, even if slightly, and that weighted point estimates do not differ irrespective of the survey design being accounted for.
Let us now examine the confidence intervals of these proportions. Traditional statistical software usually compute these without telling us about the underlying computations going on. By contrast, doing this in R requires more coding, but in the process we gain a better understanding of what is actually estimated.
Confidence intervals for proportion of categorical variables are usually computed as a sequence of binomial/dichotomic estimations -- ie one for each category. In R this needs to be specified explicitly via the `svyciprop()` and `I()` functions. The former actually computes the proportion and its confidence interval (by default 95%), whereas the latter allows us to define the category we are focusing on (in case of non dichotomic variable).
```{r cipoltab}
svyciprop(~I(Politics.s=="Significantly"),bsa17.s)
round(100*
c(prop.table(svytable(~Politics.s,bsa17.s))[2],
attr(svyciprop(~I(Politics.s=="Significantly"),bsa17.s),"ci")),1
)
```
**Question 4**
What is the proportion of respondents aged 17-34 in the sample, as well as its 95% confidence interval? You can use `RAgecat5`
### 5. Domain (ie subpopulation) estimates
Computing estimates for specific groups of a sample (for example the average age of people who reported being interested in politics) is not much more difficult than doing it for the sample as a whole. However doing it as part of an inferential analysis requires some caution. Calculating weighted estimates for a subpopulation, amounts to computing second order estimates ie an estimate for a group whose size needs to be estimated first. Therefore, attempting this while leaving out of the rest of the sample might yield incorrect results. This is why using survey design informed functions is particularly recommended in such cases.
The `survey` package function`svyby()` makes such domain estimation relatively straightforward. For instance, if we would like to compute the mean age of BSA respondents by Government Office Regions, we need to specify:
- The outcome variable whose estimate we want to compute: ie `RAgeE`
- The grouping variable(s) `GOR_ID`
- The estimate function we are going to use here: `svymean`, the same as we used before
- And the type of type of variance estimation we would like to see displayed ie standard errors or confidence interval
```{r bygor}
bsa17$gor.f<-as_factor(bsa17$GOR_ID)
bsa17.s<-svydesign(ids=~Spoint, strata=~StratID, weights=~WtFactor,data=bsa17)
round(svyby(~RAgeE,by=~gor.f,svymean,design=bsa17.s,vartype = "ci")[-1],1)
```
*Note:* we used `[-1]` from the object created by `svyby()` in order to remove a column with alphanumeric values (the region names), so that we could round the results without getting an error.
Our inference seem to suggest that the population in London is among the youngest in the country, and that those in the South West are among the oldest -- their respective 95% confidence intervals do not overlap. We should not feel so confident about differences between London and the South East for example, as the CIs partially overlap.
We can follow a similar approach with proportions: we just need to specify the category of the variable we are interested in as an outcome, for instance respondents who are significantly interested in politics, and replace `svymean` by `svyciprop`.
```{r bygorprop}
round(
100*
svyby(~I(Politics.s=="Significantly"),
by=~gor.f,
svyciprop,
design=bsa17.s,
vartype = "ci")[-1],
1)
```
**Question 5**
What is the 95% confidence interval for the proportion of people interested in politics in the South West? Is the proportion likely to be different in London? In what way? What is the region of the UK for which the precision of the estimates is likely to be the smallest?
When using `svyby()`, we can define domains or subpopulations with several variables, not just one. For example, we could have looked at gender differences in political affiliations by regions. However, as the size of subgroups decrease, so does the precision of the estimates as their confidence interval widens, to a point where their substantive interest is not meaningful anymore.
**Question 6**
Using interest in politics as before, and three category age
`RAgecat5` (which you may want to recode as a factor in order to improve display clarity):
- Produce a table of results showing the proportion of respondents significantly interested in Politics by age group
- Assess whether the age difference in interest for politics is similar for each gender?
- Based on the data, is it fair to say that men aged under 35 tend to be more likely to declare themselves interested in politics than women aged 55 and above?
### Answers
**Question 1**
The 2017 BSA is a three stage stratified random survey, with postcode sectors, adresses and individuals as the units selected at each stage. Primary sampling units were furthermore stratified according to geographies (sub regions), population density, and proportion of owner-occupiers. Sampling rate was proportional to the size of postcode sectors (ie number of addresses)
**Question 2**
From the Data Dictionary it appears that the primary sampling units (sub regions) are identified by```Spoint``` and the strata by```StratID```. The weights variable is```WtFactor```. Addresses are not provided but could be approximated with a household identifier.
**Question 3**
Not using weights would make us overestimate the mean age in the population (of those aged 16+) by about 4 years. This is likely to be due to the fact that older respondents are more likely to take part to surveys. Using survey design variables does not alter the value of the estimated population mean. However, not accounting for them would lead us to overestimate the precision/underestimate the uncertainty of our estimate with a narrower confidence interval -- by about plus and minus 2 months .
**Question 4**
The proportion of 17-25 year old in the sample is `r round(100*as.numeric(svyciprop(~I(RAgecat5 == 1),bsa17.s)[1]),1)` and its 95%confidence interval `r round(100*attr(svyciprop(~I(RAgecat5 == 1),bsa17.s),"ci"),1)`
**Question 5**
The 95% confidence interval for the proportion of people interested in politics in the South West is `r round(100*svyby(~I(Politics.s=="Significantly"),by=~gor.f,svyciprop,design=bsa17.s,vartype = "ci")[-1],1)[9,c(2,3)]`. By contrast, it is likely to be `r round(100*svyby(~I(Politics.s=="Significantly"),by=~gor.f,svyciprop,design=bsa17.s,vartype = "ci")[-1],1)[7,c(2,3)]` in London. The region with the lowest precision of estimates (ie the widest confidence interval) is Wales, with a `r as.numeric(round(100*svyby(~I(Politics.s=="Significantly"),by=~gor.f,svyciprop,design=bsa17.s,vartype = "ci")[-1],1)[10,3]-round(100*svyby(~I(Politics.s=="Significantly"),by=~gor.f,svyciprop,design=bsa17.s,vartype = "ci")[-1],1)[10,2])` percentage point difference between the upper and lower bounds of the confidence interval.
**Question 6**
```{r 6.1}
bsa17$RAgecat5.f<-as_factor(bsa17$RAgecat5)
bsa17$Rsex.f<-as_factor(bsa17$Rsex)
bsa17.s<-svydesign(ids=~Spoint, strata=~StratID, weights=~WtFactor,data=bsa17)
round(
100*
svyby(~I(Politics.s=="Significantly"),
by=~RAgecat5.f+Rsex.f,
svyciprop,
design=bsa17.s,
vartype = "ci")[c(-8,-4),c(-2,-1)],
1)
```
Older respondents both male and female tend to be more involved in politics than younger ones.
The confidence interval for the proportion of men under 35 and women above 55 interested in politics overlaps; it is unlikely that they differ in the population.
<!-- # Find number of obs per type -->
<!-- rcat<-wtd.table(bsa17$RSexAge2,bsa17$WtFactor)$sum.of.weights -->
<!-- n=length(bsa17$RSexAge2) -->
<!-- # Get the proportion -->
<!-- p_hat = rcat[1]/n -->
<!-- alpha=.95 -->
<!-- # Calculate the critical z-score -->
<!-- z = qnorm(1-alpha/2) -->
<!-- # Compute the CI -->
<!-- 100*(p_hat + c(-1,1)*z*sqrt(p_hat*(1-p_hat)/n)) -->
<!-- svyby(~RAgeE,by=~GOR_ID+Rsex,svymean,design=bsa17.s,vartype = "ci") -->
<!-- svyby(~I(Politics.s=="Not"),by=~GOR_ID+Rsex,svyciprop,design=bsa17.s,vartype="ci") -->
<!-- round(100*prop.table(ftable(svytable(~GOR_ID+Rsex+Politics.s,design=bsa17.s)),1),1) -->
<!-- round(100*prop.table(ftable(svytable(~GOR_ID+Rsex+Politics.s,design=bsa17.s)),1),1) -->