/
README.Rmd
325 lines (211 loc) · 11 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# relfeas
The goal of relfeas is to allow researchers to use reliability reported in test-retest studies to make assessments of the feasibility of new study designs. This R package accompanies the preprint: [We need to talk about reliability: Making better use of test-retest studies for study design and interpretation](https://www.biorxiv.org/content/early/2018/03/06/274894).
**Note:** The package is still a bit rough around the edges on some functions. If you find anything wrong, ugly, or that I've not included something obviously useful which would follow from anything presented, please don't be shy about letting me know in the Issues or by mail at mathesong[at]gmail.com
## Installation
You can install relfeas from github with:
```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("mathesong/relfeas")
```
### Examples from the paper
Below, I detail the calculations performed in the manuscript, and show the code used from the package. Please refer to the manuscript for the context
```{r, message=F, warning=F}
library(relfeas)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(pwr)
library(knitr)
```
#### Example 1
**Summary:** this example shows how the reliability of a test-retest validation study with low inter-individual variance and low reliability can be approximated for an applied study with higher inter-individual variance. This shows that despite the low reliability estimated in the test-retest study, this measurement demonstrates high reliability for the research question in the applied study.
```{r}
sd2extrapRel(sd=0.32, icc_original = 0.32,
sd_original = 0.10)
```
Even if the measurement error were to double due to the use of partial volume effect correction in the applied study (an extreme assumption), this conclusion would still hold.
```{r}
sd2extrapRel(sd = 0.32,icc_original = 0.32,
sd_original = 0.10, tau = 2)
```
#### Example 2
**Summary:** this example shows how the reliability of measurements has enormous implications for power analysis when planning a study. In this example, both measures show reliability $\geq$ 0.7, but when the reliability of these measures is taken into consideration during power analysis, the minimal required sample size for this research question (more details in the paper) is nearly doubled.
```{r}
r_attenuation(0.8, 0.7)
pwr::pwr.r.test(r=sqrt(0.3), power=0.8)
pwr::pwr.r.test(r=sqrt(0.3)*r_attenuation(0.8, 0.7), power=0.8)
```
#### Example 3
**Summary:** reliability of individual measurements, or showing the robustness of within-individual effects is often, mistakenly, taken as a proxy for good reliability of assessing between-individual differences in within-individual effects ([see the excellent paper by Hedge, Powell and Sumner 2017](https://link.springer.com/article/10.3758/s13428-017-0935-1)). Here we demonstrate this for this particular application.
Characteristics of the sample:
```{r}
# Sample characteristics
meanbp <- 1.91
delta_mean <- -0.12 # Mean percentage change in difference study
delta_sd <- 0.1 # SD of percentage change in difference study
sem <- icc2sem(icc = 0.8, sd = 0.22)
delta_sd_trt <- 0.073 # SD of percentage change in test-retest study
# Characteristics across two measurements
sd = delta_sd*(meanbp)
(delta_icc <- sem2icc(sem*2, sd))
```
Within-individual effects:
```{r}
# Smallest detectable difference: individual
(sdd_indiv <- 100*((sem*1.96*sqrt(2))/meanbp))
# Smallest detectable difference: group
samplesize <- 2
(sdd_group <- 100*(((sem/sqrt(samplesize))*1.96*sqrt(2))/meanbp))
# Within-individual power analysis
(power_n_within <- pwr::pwr.t.test(d= delta_mean/delta_sd_trt , sig.level = 0.05,
type = "paired", alternative = "less",
power=0.8))
```
Between-individual assessment of within-individual changes
```{r}
ss_total <- sumStat_total(n1 = 20, mean1 = abs(delta_mean*meanbp), sd1 = delta_sd*meanbp,
n2 = 20, mean2 = abs(2.5*delta_mean*meanbp), sd2=delta_sd*meanbp)
# Estimation of reliability
(delta_icc_patcntrl <- sem2icc(sem*2, ss_total$sd_total))
(d_true <- ss_total$d)
# Using this estimated reliability for estimation of attenuation
(d_meas <- d_attenuation(rel_total = delta_icc_patcntrl, d = d_true))
# Increase in the number of required participants after taking reliability into account
(pwr_increase_unpaired <- pwr::pwr.t.test(d=d_meas, power = 0.8, alternative = "greater")$n /
pwr::pwr.t.test(d=d_true, power = 0.8, alternative = "greater")$n)
(pwr_increase_paired <- pwr::pwr.t.test(d=d_meas, power = 0.8, type = "paired", alternative = "greater")$n /
pwr::pwr.t.test(d=d_true, power = 0.8, type = "paired", alternative = "greater")$n)
```
#### Example 4
**Summary:** A measurement outcome with high reliability but also high variance (e.g. PBR28 for translocator protein, TSPO) is less well suited for assessing small proportional changes within individuals than a measurement outcome with low variance, even if it has relatively low reliability (e.g. AZ10419369 for frontal cortex serotonin 1B receptors). However, larger proportional within-individual changes are more likely in the former due to the larger variance.
```{r}
tspo_hab_10es <- 10/42
ser1b_10es <- 10/6
(tspo_n <- round(pwr::pwr.t.test(d = tspo_hab_10es, power = 0.8)$n))
(ser1b_n <- round(pwr::pwr.t.test(d = ser1b_10es, power = 0.8)$n))
```
#### Example 5
Variance reduction strategies (see paper) for PBR28 lead to new outcomes with poor reliability (see paper by [Matheson et al., 2017](https://ejnmmires.springeropen.com/articles/10.1186/s13550-017-0304-1)). Very large differences between groups are required before
these new outcome measures begin to be reliable for applied research questions.
One can conceptualise the required Cohen's D in various ways
to get an idea of whether or not such a large effect is or is not reasonable for a particular research question. ()
* **d** Cohen's D
* **u3** Cohen's U3
* **overlap** Distributional overlap
* **cles** Common Language Effect Size
(more details in the paper, or [here](http://rpsychologist.com/d3/cohend/))
```{r}
#########
# Basic #
#########
(sd_basic <- extrapRel2sd(0.7, 0.5))
d_basic <- sdtot2mean2(sd_total = sd_basic, n1 = 20, n2=20, mean1 = 1)
(es_basic <- cohend_convert(d=d_basic$d))
##############
# Acceptable #
##############
(sd_acceptable <- extrapRel2sd(0.8, 0.5))
d_acceptable <- sdtot2mean2(sd_total = sd_acceptable, n1 = 20, n2=20, mean1 = 1)
(es_acceptable <- cohend_convert(d=d_acceptable$d))
############
# Clinical #
############
(sd_clin <- extrapRel2sd(0.9, 0.5))
d_clin <- sdtot2mean2(sd_total = sd_clin, n1 = 20, n2=20, mean1 = 1)
(es_clin <- cohend_convert(d=d_clin$d))
```
We can also plot these effects to get a better idea of how they would look
```{r EffectSizes, fig.width = 10, fig.height = 3}
graphtheme <- theme(axis.text.x=element_blank(),
axis.text.y=element_blank())
d_fig <- grid.arrange(
plot_difference(d = d_basic$d) +
labs(x='Lowest Acceptable Reliability (0.7)', title=NULL, y='Probability') +
graphtheme,
plot_difference(d = d_acceptable$d) +
labs(x='Acceptable Reliability (0.8)', title=NULL, y='Probability') +
graphtheme,
plot_difference(d = d_clin$d) +
labs(x='Clinical Reliability (0.9)', title=NULL, y='Probability') +
graphtheme,
nrow=1)
```
Note that the figure above, as well as the function to generate these figures and these alternative explanation metrics describing effect sizes, are
inspired by the work of Kristoffer Magnusson, a.k.a. [R Psychologist](https://twitter.com/krstoffr), and his [interactive Cohen's D figure](http://rpsychologist.com/d3/cohend/) as well as his description of where
[Cohen was wrong about overlap](http://rpsychologist.com/cohen-d-proportion-overlap).
### Other Examples
Not all the functions are used in the paper. Here a demonstrate a couple of others.
#### Effect Size Comparisons
It is extremely important to consider effect sizes when performing study design. For this purpose, I have included plotting functions such as in example 5 above. I have also included a conversion function for Cohen's D alternatives for converting between them if one feels more intuitive than another.
```{r}
## If one wishes to know the true size of what a particular Cohen's D
## represents (e.g. as a Common Language Effect Size)
cohend_convert(d = 1)
## If one wishes to know what the corresponding Cohen's D would be for
## an application for which the effect size can be more intuitively understood
## as a Common Language Effect Size for example
cohend_convert(cles = 0.8)
```
#### Performing test-retest analysis
I have also included a test-retest analysis function which includes all the usual metrics which are reported in PET studies, as well as a few others which are useful. It also produces a tidy output, so it can be used with tidy data within R pipelines. I will use some data from the *agRee* package.
First we prepare the data into tidy format
```{r}
data("petVT", package = 'agRee')
Region <- map_df(petVT, ~nrow(.)) %>%
gather(ROI, participants) %>%
group_by(ROI) %>%
map2(.x = .$ROI, .y = .$participants,.f = ~ rep(x = .x, times=.y)) %>%
do.call('c', .)
petVT <- do.call(rbind, petVT) %>%
as_tibble() %>%
rename(Measurement1=V1, Measurement2=V2) %>%
mutate(Region = Region) %>%
mutate(Participant = 1:n()) %>%
gather(Measurement, Outcome, -Participant, -Region)
```
Now the data looks as follows:
```{r}
head(petVT)
```
Then we can perform a test-retest analysis for each separate study/outcome/region.
```{r}
calc_trt <- petVT %>%
group_by(Region) %>%
nest() %>%
mutate(outcomes = map(data, ~ trt( data=.x, values='Outcome',
cases = 'Participant', rater = 'Measurement' )))
trt_out <- map_df(calc_trt$outcomes, 'tidy') %>%
mutate(Region = calc_trt$Region) %>%
select(Region, everything())
kable(trt_out, digits = 2)
```
The table lists the following for each sample:
##### Distribution
* __mean__: Mean
* __sd__: Standard deviation
* __cov__: Coefficient of variation
* __skew__: Skew
* __kurtosis__: Kurtosis
##### Reliability
* __icc__: ICC
* __icc_l__: Lower bound of the 95% confidence interval of ICC
* __icc_u__: Upper bound of the 95% confidence interval of ICC
##### Measurement Imprecision
* __wscv__: Within-subject coefficient of variation: relative imprecision
* __sdd__: Smallest detectable difference: absolute imprecision (95% confidence interval)
##### Variation
* __absvar__: Absolute variability: the average absolute percentage change between measurements within individuals
* __signvar__: Signed variability: same as above but signed. This is a check for bias between measurements
* __signvar_sd__: Standard deviation of the signed variance values. This is useful for power analysis for within-subjects designs.