Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leave-one-out ISC permutations/bootstraps not valid #466

Open
cbaldassano opened this issue Jun 5, 2020 · 18 comments
Open

Leave-one-out ISC permutations/bootstraps not valid #466

cbaldassano opened this issue Jun 5, 2020 · 18 comments
Assignees

Comments

@cbaldassano
Copy link
Collaborator

When using permutation_isc and bootstrap_isc with pairwise=False, the permutations are carried out by applying signflips, group exchanges, or bootstrap resampling to the correlation values of individual subjects with the rest of the group. For example, in the one-sample permutation each subject's LOO value is randomly flipped or not flipped:

sign_flipper = prng.choice([-1, 1],

I don't think this is correct. When using LOO ISC, each subject's data is used to compute ISC value (as part of the "group"), and flipping the sign of a subject's data would impact every LOO value for all subjects, not just the LOO value of that specific subject. I believe this problem is true for both the 1-sample and 2-sample permutation and bootstrap tests as they are currently coded.

The Chen2016 reference, as far as I can tell, only describes permutations and bootstraps in the pairwise case, and does not describe a valid way of doing this for the leave-one-out case. I can't think of a way to do this that would be valid, so my proposed solution would be to have these functions thrown an error if they are attempted to be called with LOO ISCs.

Sam, let me know if I am misunderstanding how these functions are working.

@snastase
Copy link
Contributor

snastase commented Jun 16, 2020

@cbaldassano thanks for the comment! I'm trying to understand your question here... In the one-sample permutation test, the sign-flipping is applied to the correlation values resulting from the ISC computation (for both the pairwise and leave-one-out approaches). If you have N = 30 subjects, at each permutation you'll get a 30-long vector of -1s and +1s (and for low N, we compute the 2**N exact set of all permutations)—for both approaches. In the pairwise approach, the ISC computation results in an N x N correlation matrix (N*(N - 1)/2 unique correlation values) and the sign-flipping is applied to the rows/columns of the matrix to achieve a subject-wise (not pair- or element-wise) permutation. In the leave-one-out approach, the ISC computation results in N correlation values and the sign-flipping is applied to these N values for subject-wise permutation. You're correct that in the leave-one-out approach, each of these N correlation values corresponds to a left-out subject's correlation with the average of the N - 1 other subjects. But there's no flipping applied to the actual time series data feeding into the ISC computation (or components of the ISC computation like averaging the N - 1 subjects)... So I'm not sure I see in principle why this would cause a problem (but it might!). Do you find that this behaves strangely in practice? (No one has really done the work to see how well these different methods control false positive rates for the leave-one-out approach, as was done for the pairwise approach in Gang Chen's papers.)

In general, the time-series averaging step of the leave-one-out ISC computation causes issues. This is also why I only apply phase randomization and circular time-shift randomization to the left-out subject—because applying these randomizations prior to averaging will generally yield an average times series that simply flat or noise (and comparing a left-out subject in each randomization iteration to a flat time series doesn't seem like a fair comparison).

@cbaldassano
Copy link
Collaborator Author

But there's no flipping applied to the actual time series data feeding into the ISC computation (or components of the ISC computation like averaging the N - 1 subjects)

Right, this is my concern. In general for a permutation test (or bootstrapping), we want to do something independently to each of the exchangeable units in the dataset, which in this case are the subjects. Chen2016 talks explicitly about this, saying that we want to flip the signs of the subjects and not the individual pairwise correlation values.

The problem is that in the LOO case, there is no way to flip the sign of say subject #2, since flipping subject #2's sign would impact the LOO values for all subjects. Flipping just subject #2's LOO value means that this result no longer corresponds to a valid dataset - there is no operation we could do to the original data that would result in this set of LOO values. In Gang Chen terms, the subject #2 LOO value is not an exchangeable unit that can be permuted independently of the other LOO values.

So I think this is definitely a problem in principle. To get a sense for whether this is actually a problem in practice, I wrote a little test script generates permuted datasets by flipping (zero-mean) timescourses of random subjects. In the pairwise ISC case, this corresponds exactly to the permutation test in brainiak. I ran this on the univariate angular gyrus timecourses from Sherlock (16 subjects).

Here is the permutation distribution for randomly flipping subjects and then computing ISC ("LOO") and for the current brainiak version in which we flip individual subject LOO values ("LOO Flip1").

image

Code to reproduce is here

These do seem meaningfully different to me, though the biggest difference is on the lower tail which we usually don't care about.

It's up to you what you think makes the most sense here - one option would be to just add some more explanation in the documentation about what the LOO permutations/bootstraps are doing and the potential caveats, and let users decide on their own (since I don't think there is a better way to do this given just the LOO values).

@mihaic mihaic added this to the v0.11 milestone Jun 29, 2020
@tristansyates
Copy link

tristansyates commented Jul 16, 2020

Hi, sorry to interrupt, but I wanted to ask about this! I've been analyzing some of my data using the leave-one-out ISC approach and previously determined significance through BrainIAK's bootstrap isc function. It does look to me like the Gang Chen 2017 paper doesn't think that the current implementation is the correct way to do it. However, I wonder if this proposed version might be too conservative.

As an example: so this is what the leave-one-out ISC values look like for the lateral occipital cortex of two different populations of subjects (watching a very short, 90 TR video). Flipping random subjects (using the code above) gives a p-value of 0.02 for population 1 and a p-value of 0.09 for population 2. The Brainiak permutation and bootstrapping functions put them both as significant, p < 0.001. Maybe that shouldn't be the case for population 2, but I do think the ISC values for population 1 look pretty reliable. Do you have thoughts on this?

(as an alternative I tried randomly shifting the time course for a subset of participants and it seems to give about the same results as the flipping version, though is more liberal when the number of TRs is increased)

image

@cbaldassano
Copy link
Collaborator Author

This is actually a great example of why the current implementation is misleading. The leave-one-out values in pop1 look tightly clustered around 0.2, so it seems like it should be obvious that the population mean here is >0. But these values are not independent samples - having a few very good subjects in this population will pull all of these points higher. So running a true permutation test (flipping random subject timecourses) shows that the confidence interval is still relatively wide (though 98% of it is above 0).

@snastase
Copy link
Contributor

Hmm, I'm still unsure what to do about this. As far as I know, nobody has done the work to see how well various randomization tests will control false positive rates (FPRs) for the leave-one-out approach (in the style of Gang Chen's paper(s) on the pairwise approach). I'm fine with putting a disclaimer on these functions that they may not control FPRs (and point users to Gang Chen's work), but that disclaimer should apply to basically all of these statistical tests (because they haven't been thoroughly evaluated yet) as well as the subject-wise bootstrap and permutation tests (in certain contexts) recommended by Gang Chen. These are all imperfect solutions. In this sense, I'm not sure the software is wrong per se, but maybe we can do a better job of noting the limitations and preventing people from using it wrongly.

@cbaldassano, in your example, it seems like the permutation-based sign-flipping approach currently implemented in BrainIAK yields a wider—i.e. more conservative—null distribution (orange) than your approach (blue). I would be more inclined to stick with the more conservative approach, as it will yield lower FPRs. Again, one possible reason the null distribution generated by your approach is asymmetric, clustered at zero, and with very few values below zero is because during the averaging step of the ISC computation, the flipped time series will collapse to a roughly flat or noise time series; in most permutations, the left-out subject will be compared to this artificially flat time series, resulting in zero-ish correlation. This is also true of the phase randomization and circular time-shifting approaches if you randomize before the averaging step of the ISC computation. Averaging N–1 subjects that have been randomized in any way also arguably results in an invalid dataset. This is why in the phaseshift_isc and timeshift_isc functions, I apply the randomization to the left-out subject only, so that at least the comparison is made to a valid average time series.

@tristansyates—I'm surprised that @cbaldassano's proposed method is more conservative in your example. Isn't this inconsistent with the distributions in Chris's example? Just eyeballing those strip plots, I would guess that both populations should yield a significant group effect (mean > 0).

My reading of Gang Chen's work is that the leave-one-out approach (in particularly the averaging—i.e. smoothing—operation) is irredeemable in general... We can invent all sorts of ways of randomizing the input time series, but pinning down the statistical implications for these inventions will be difficult. (For example, what is the null hypothesis specified by randomly inverting subject time series? I have no idea.) All of these approaches may introduce oddities into the data that don't quite match the original data. For posterity, Chen has an appendix in the third installment of that series demonstrating that the leave-one-out approach also yields inflated FPRs (unsurprisingly), but I can't glean much from it: https://www.sciencedirect.com/science/article/pii/S1053811919310651#appsec1

@snastase
Copy link
Contributor

@cbaldassano @tristansyates if you're interested (or know anyone else who might be), we could put our heads together and systematically evaluate how well these models control FPRs with simulated data using the leave-one-out approach (in the style of Gang Chen's work). Could turn into a nice little paper...

@snastase
Copy link
Contributor

Starting to look into this simulation—will share soon.

@tristansyates
Copy link

Yes, it is inconsistent with the simulation above, which is confusing to me (and I don't see any bugs in my code!). These are the distributions I'm getting for the two different methods with that population one data:

image

Is it possible that the number of TRs is making a difference here? At any rate, I'd be interested in looking into this more systemically!

@snastase
Copy link
Contributor

snastase commented Jul 17, 2020

Okay here's a first pass at getting FPRs for different tests using simulated data: https://github.com/snastase/isc-tests. The notebook should render interactively via Binder (link; takes awhile to build). Feel free to fork, contribute, etc. I haven't implemented @cbaldassano's proposed method yet. I'm seeing FPRs at ~.10–.15 for a nominal significance level of p = .05 across all tests, but I'm only running 100 simulations for the sake of time—we'll need to see what this looks like at 1000 iterations. I'm having a little trouble conceptually linking this to Gang Chen's simulation approach because here we want to simulate time series, whereas they're simulating correlation pairs (with varying rho, where rho is the correlation between two correlation pairs pivoting on shared subject)... but I'm not sure we need to stick too closely to their framework. The correlated_data function should be useful for seeing how these null distributions shift as we vary the number of TRs, number of subjects, and correlation level. Might also be worth re-tooling that a bit to inject correlation in a handful of subjects, rather than uniformly across subjects, to see if there are pathologies that emerge when e.g. only two subjects are highly correlated but others are not (specified by the cov variable in multivariate_normal). Later on, maybe we can pull in actual fMRI data for a real-life example, or use resting-state data for evaluating FPRs.

@snastase
Copy link
Contributor

snastase commented Jul 17, 2020

@tristansyates—interesting! I'm curious how phaseshift_isc and timeshift_isc would measure up here as well (you think more similar to Chris's blue distribution?).

@tristansyates
Copy link

I've never heard of binder before -- it's super cool! I've forked the directory and added @cbaldassano's proposed method (here.)The FPR rate is 0.05, demonstrating that it is a more conservative test.

The actual fMRI data I've been working with also has some missing time points (we gaze code our participants and remove times when their eyes are not on the screen). (This is one of the reasons why we use the leave one out approach because we can nanmean the group). Along with simulating how correlations vary by non-uniformity, looking into the impact of missing time points may be something interesting to pursue as well.

@snastase I'll check how phaseshift and timeshift measure in the sample and let you know! Currently thinking that it'll be a narrower distribution like the blue one above

@tristansyates
Copy link

Oops yep something was wrong with my first test. This is the actual distribution and it mirrors what Chris showed! :

image

@cbaldassano
Copy link
Collaborator Author

Thanks for setting up that notebook Sam! I created a pull request to add the flipping-timeseries test for comparison, and I just now realized that Tristan has a similar fork. Both Tristan's and mine are getting FPR=0.05 while the current brainiak methods are yielding higher FPRs. I'm not sure how to square this with the plots that Tristan and I were making in our own simulations though - maybe the relative shapes of these nulls (across methods) are different when there really is vs. is not a real signal in the data?

@mihaic
Copy link
Member

mihaic commented Aug 24, 2020

You may have noticed I tagged this issue for the upcoming BrainIAK release. Do you think you will converge on a solution in the next few weeks?

@cbaldassano
Copy link
Collaborator Author

This will probably take a little more discussion to get a complete fix. I do think that my pull request (consistent with what Sam and Tristan were seeing) demonstrates that there is a problem with the way things are currently implemented in brainiak. I'm not sure whether we want to just say that the current way has some bias and flag this in the documentation, or if we think it is worth removing/deprecating the LOO brainiak functions. The solution I'm proposing would be done outside of the ISC function itself, so I don't know whether it is in scope to be part of the brainiak package itself. I'll leave it up to Sam about what to do, but my vote would be to have the isc code raise either a warning or error if someone tries to get p values from LOO ISC values.

@snastase
Copy link
Contributor

snastase commented Sep 4, 2020

Sorry—lost track of this for a few weeks. I agree with @cbaldassano that this will take a bit more work to fully resolve, so I'm not sure we should hold up the release (I don't have a ton of bandwidth to do it in the next weeks). I don't think we should remove or deprecate the nonparametric tests for leave-one-out ISCs—I'm seeing FPRs .05–.10 across all of these methods (at n_subjects = 30, n_TRs = 300), which is not ideal, but it's also not terrible; all of them are better (marginally) than the baseline parametric t-test (FPR = .11)—see the updated notebook here. However, we need to more systematically explore whether things misbehave at different parameter settings. But I think it's definitely a good idea to add a disclaimer to the documentation noting that these nonparametric tests may not strictly control FPRs at the nominal level. I also pulled in @cbaldassano's time-series flipping approach, adapted it to the style of the other tests, and tested a version where the sign-flipping is only applied to left-out subjects (to avoid this issue of averaging flipped or randomized time-series; yields FPR = .06). I'm still not entirely sure how to think about this approach. I'm not sure if it makes sense to flip the time series around zero (even if it's mean-centered) because I'm not sure BOLD time series has symmetric variance around zero (i.e. peaks and troughs in the time series aren't necessarily similarly shaped or distributed, and don't have the same physiological interpretation); and I don't know what null hypothesis is embodied in this null distribution.

@mihaic mihaic removed this from the v0.11 milestone Sep 4, 2020
@cbaldassano
Copy link
Collaborator Author

I agree, for now maybe we can just add a documentation line mentioning the the LOO permutation test is not guaranteed to strictly control FPR < alpha.

Applying sign-flipping to only the held-out subject is interesting - shouldn't that give large negative (not 0) correlations if there is a real signal? I guess that doesn't matter for null hypothesis testing though (where we only care about false positives in the case where there is no signal). Another alternative to sign-flipping that is maybe less disruptive could be doing circular time-shifts, which has worked well in my lab (at least for relatively long timecourses), but we can continue this discussion in the isc-tests repo.

@eglerean
Copy link

eglerean commented Sep 5, 2020

Joining the conversation: circular shifting has also worked well for us, but you need to make sure you shift of enough time points (rule of thumb is one hrf tail ~ 30 seconds). I’ve also used block time shifting, which helps a bit when time series are not that long. First you chunk the time series in block of random size (but long enough to preserve autocorrelation, eg min length ~ 50 seconds), then you shuffle the blocks. There’s a reference for this method, I can link it once I’m not on mobile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants