/
22_model-data.Rmd
157 lines (129 loc) · 6.67 KB
/
22_model-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
## Data Sources {#sec:model-data}
```{r stopper, eval = FALSE, cache = FALSE, include = FALSE}
knitr::knit_exit()
```
```{r knitr-02-2_model-data, include = FALSE, cache = FALSE}
source(here::here("assets-bookdown", "knitr-helpers.R"))
```
```{r 02-data-packages, cache = FALSE}
library("here")
library("magrittr")
library("tidyverse")
library("scales")
```
<!-- in chapter: model -->
### Response data {#sec:model-y}
```{r poll-data}
polls_raw <- here("data", "polls-clean", "megapoll.RDS") %>%
read_rds() %>%
ungroup() %>%
print()
```
```{r clean-polls}
polls_raw %>%
mutate(
names = map(cleaned_data, names)
) %>%
unnest(names)
polls_raw %>%
mutate(
party = map(cleaned_data, ~ count(.x, party))
) %>%
unnest(party)
polls <- polls_raw %>%
mutate(
slim = map(
.x = cleaned_data,
.f = ~ {
.x %>%
select(domain, item_code)
}
)
) %>%
select(-cleaned_data) %>%
unnest(slim) %>%
distinct() %>%
mutate(
domain = case_when(
domain == "immigratioin" ~ "immigration",
item_code == "econ_env.jobs" ~ "environment",
item_code == "abort_20wk.ban" ~ "abortion",
TRUE ~ domain
),
big_domain = case_when(
domain %in% c("budget", "econ", "health") ~ "econ: economics/budget/healthcare",
domain %in% c("abortion", "gender", "lgbtq") ~ "gender: sex/gender/reproduction",
domain %in% c("race", "law", "guns") ~ "social: race/police/drugs/guns",
domain %in% c("immigration", "defense") ~ "imm: immigration/defense",
domain %in% c("environment") ~ "env: environment"
)
) %>%
print()
items <- polls %>%
select(contains("domain"), item_code) %>%
distinct() %>%
print()
items %>% count(item_code) %>% print(n = nrow(.))
items %>% count(domain)
items %>% count(domain, big_domain)
```
```{r calculate-responses}
Ns <- polls_raw %>%
mutate(
N_respondents = map_dbl(
.x = cleaned_data,
.f = ~ .x %>%
filter(party %in% c(1, 2)) %>%
filter(state_abb %in% c(state.abb)) %>%
count(caseid) %>%
nrow()
),
N_groups = map_dbl(
.x = cleaned_data,
.f = ~ .x %>%
filter(party %in% c(1, 2)) %>%
filter(state_abb %in% c(state.abb)) %>%
count(state_abb, district_num, party) %>%
nrow()
),
N_responses = map_dbl(cleaned_data, nrow)
) %>%
print()
```
```{r calculate-cases}
n_group_items <- polls_raw %>%
unnest(cleaned_data) %>%
filter(party %in% c(1, 2)) %>%
filter(state_abb %in% c(state.abb)) %>%
select(state_abb, district_num, party, item_code) %>%
distinct() %>%
nrow() %>%
print()
```
```{r calculate-domains}
domains <- items %>%
count(big_domain) %>%
split(.$big_domain) %>%
set_names(~ str_split(., pattern = ":", simplify = TRUE)[,1]) %>%
print()
```
Survey response data are drawn from the Cooperative Congressional Election Study (CCES) waves 2012, 2014, 2016, and 2018, and the 2016 wave of the American National Election Study (ANES).
I restrict the data to include only respondents who identify with the Republican and Democratic Parties and who reside in one of the 435 congressional districts.
I combine the responses to policy questions in these surveys into a single dataset, commonly called a "megapoll" [@kastellec-et-al:2019:mrp-primer].
The megapoll contains `r nrow(polls)` policy items across all of the surveys it contains.
Some items in different surveys are either worded identically or similarly enough to be considered as the same item, resulting in `r (n_items <- items %>% count(item_code) %>% nrow())` unique items used for scaling.
There are `r domains$econ$n` items about economics, the federal budget, and health care; `r domains$env$n` items about the environment and climate change; `r domains$gender$n` items on sex/gender equality and reproductive rights; `r domains$imm$n` on immigration and national defense; and `r domains$social$n` items on other social issues such as race, policing, drug laws, and gun rights.
If an item contained a response scale instead of a binary response option, I collapse responses into a binary coding to facilitate estimation with the probit model.
A table describing the text and data sources of all items is included in Appendix \@ref(ch:appendix-model).
Each survey contained responses from between `r min(Ns$N_respondents) %>% comma()` and `r max(Ns$N_respondents) %>% comma()` respondents, totaling `r sum(Ns$N_responses) %>% comma()` individual item responses across all surveys, respondents, and items.
After aggregating respondents into district-party groups, the data contain `r comma(n_group_items)` observations at the group-item level, which averages to approximately `r number(n_group_items / (435 * 2), accuracy = 1)` item responses per group, or `r number(n_group_items / (n_items), accuracy = 1)` group-level observations per item.
### Covariates {#sec:model-covariates}
I draw district-level demographic data from @foster-molina:2016:data, whose study of district demographics and congressional voting includes a publicly released time-series of congressional district demographics.
I use the data from the 2012 election cycle, which is the first general election of the districting cycle for which I estimate ideal points.
I measure the median income, median age, Gini coefficient, and the percent of the district population that was White, had a college degree, was born outside the U.S., and was unemployed.
State covariate data were included from the Correlates of State Policy Project [@jordan-grossman:2017:CSPP].
I include a state measure of the percent of the population that is Evangelical Christian, a measure that has been shown to predict aggregate opinion well [@lax-phillips:2009:mrp; @buttice-hightin:2013:MRP], a predictor that I don't measure at the district level.
I also measure per capita income at the state level and the percent of the state population that is non-White, which are similar to variables I include at the district level but may capture different patterns at different levels of aggregation [e.g. @gelman-et-al:2007:red-state-qjps].
Although researchers commonly worry that correlation among model covariates inflates the variance of coefficient estimates is not of particular concern for this model, because covariates are included to improve the prediction of ideal points, not for inference on the regression coefficients themselves.
Furthermore, prior distributions for the regression parameters serve a similar function as a ridge penalty, regularizing coefficients against noisy or weakly-identified solutions [@bishop:2006:pattern-rec].
To facilitate the specification of weakly regularizing priors across all covariates, I scale all covariates to have mean zero and variance $1$ before estimation.