-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leave-one-out ISC permutations/bootstraps not valid #466
Comments
@cbaldassano thanks for the comment! I'm trying to understand your question here... In the one-sample permutation test, the sign-flipping is applied to the correlation values resulting from the ISC computation (for both the pairwise and leave-one-out approaches). If you have In general, the time-series averaging step of the leave-one-out ISC computation causes issues. This is also why I only apply phase randomization and circular time-shift randomization to the left-out subject—because applying these randomizations prior to averaging will generally yield an average times series that simply flat or noise (and comparing a left-out subject in each randomization iteration to a flat time series doesn't seem like a fair comparison). |
Right, this is my concern. In general for a permutation test (or bootstrapping), we want to do something independently to each of the exchangeable units in the dataset, which in this case are the subjects. Chen2016 talks explicitly about this, saying that we want to flip the signs of the subjects and not the individual pairwise correlation values. The problem is that in the LOO case, there is no way to flip the sign of say subject #2, since flipping subject #2's sign would impact the LOO values for all subjects. Flipping just subject #2's LOO value means that this result no longer corresponds to a valid dataset - there is no operation we could do to the original data that would result in this set of LOO values. In Gang Chen terms, the subject #2 LOO value is not an exchangeable unit that can be permuted independently of the other LOO values. So I think this is definitely a problem in principle. To get a sense for whether this is actually a problem in practice, I wrote a little test script generates permuted datasets by flipping (zero-mean) timescourses of random subjects. In the pairwise ISC case, this corresponds exactly to the permutation test in brainiak. I ran this on the univariate angular gyrus timecourses from Sherlock (16 subjects). Here is the permutation distribution for randomly flipping subjects and then computing ISC ("LOO") and for the current brainiak version in which we flip individual subject LOO values ("LOO Flip1"). These do seem meaningfully different to me, though the biggest difference is on the lower tail which we usually don't care about. It's up to you what you think makes the most sense here - one option would be to just add some more explanation in the documentation about what the LOO permutations/bootstraps are doing and the potential caveats, and let users decide on their own (since I don't think there is a better way to do this given just the LOO values). |
Hi, sorry to interrupt, but I wanted to ask about this! I've been analyzing some of my data using the leave-one-out ISC approach and previously determined significance through BrainIAK's bootstrap isc function. It does look to me like the Gang Chen 2017 paper doesn't think that the current implementation is the correct way to do it. However, I wonder if this proposed version might be too conservative. As an example: so this is what the leave-one-out ISC values look like for the lateral occipital cortex of two different populations of subjects (watching a very short, 90 TR video). Flipping random subjects (using the code above) gives a p-value of 0.02 for population 1 and a p-value of 0.09 for population 2. The Brainiak permutation and bootstrapping functions put them both as significant, p < 0.001. Maybe that shouldn't be the case for population 2, but I do think the ISC values for population 1 look pretty reliable. Do you have thoughts on this? (as an alternative I tried randomly shifting the time course for a subset of participants and it seems to give about the same results as the flipping version, though is more liberal when the number of TRs is increased) |
This is actually a great example of why the current implementation is misleading. The leave-one-out values in pop1 look tightly clustered around 0.2, so it seems like it should be obvious that the population mean here is >0. But these values are not independent samples - having a few very good subjects in this population will pull all of these points higher. So running a true permutation test (flipping random subject timecourses) shows that the confidence interval is still relatively wide (though 98% of it is above 0). |
Hmm, I'm still unsure what to do about this. As far as I know, nobody has done the work to see how well various randomization tests will control false positive rates (FPRs) for the leave-one-out approach (in the style of Gang Chen's paper(s) on the pairwise approach). I'm fine with putting a disclaimer on these functions that they may not control FPRs (and point users to Gang Chen's work), but that disclaimer should apply to basically all of these statistical tests (because they haven't been thoroughly evaluated yet) as well as the subject-wise bootstrap and permutation tests (in certain contexts) recommended by Gang Chen. These are all imperfect solutions. In this sense, I'm not sure the software is wrong per se, but maybe we can do a better job of noting the limitations and preventing people from using it wrongly. @cbaldassano, in your example, it seems like the permutation-based sign-flipping approach currently implemented in BrainIAK yields a wider—i.e. more conservative—null distribution (orange) than your approach (blue). I would be more inclined to stick with the more conservative approach, as it will yield lower FPRs. Again, one possible reason the null distribution generated by your approach is asymmetric, clustered at zero, and with very few values below zero is because during the averaging step of the ISC computation, the flipped time series will collapse to a roughly flat or noise time series; in most permutations, the left-out subject will be compared to this artificially flat time series, resulting in zero-ish correlation. This is also true of the phase randomization and circular time-shifting approaches if you randomize before the averaging step of the ISC computation. Averaging N–1 subjects that have been randomized in any way also arguably results in an invalid dataset. This is why in the @tristansyates—I'm surprised that @cbaldassano's proposed method is more conservative in your example. Isn't this inconsistent with the distributions in Chris's example? Just eyeballing those strip plots, I would guess that both populations should yield a significant group effect (mean > 0). My reading of Gang Chen's work is that the leave-one-out approach (in particularly the averaging—i.e. smoothing—operation) is irredeemable in general... We can invent all sorts of ways of randomizing the input time series, but pinning down the statistical implications for these inventions will be difficult. (For example, what is the null hypothesis specified by randomly inverting subject time series? I have no idea.) All of these approaches may introduce oddities into the data that don't quite match the original data. For posterity, Chen has an appendix in the third installment of that series demonstrating that the leave-one-out approach also yields inflated FPRs (unsurprisingly), but I can't glean much from it: https://www.sciencedirect.com/science/article/pii/S1053811919310651#appsec1 |
@cbaldassano @tristansyates if you're interested (or know anyone else who might be), we could put our heads together and systematically evaluate how well these models control FPRs with simulated data using the leave-one-out approach (in the style of Gang Chen's work). Could turn into a nice little paper... |
Starting to look into this simulation—will share soon. |
Yes, it is inconsistent with the simulation above, which is confusing to me (and I don't see any bugs in my code!). These are the distributions I'm getting for the two different methods with that population one data: Is it possible that the number of TRs is making a difference here? At any rate, I'd be interested in looking into this more systemically! |
Okay here's a first pass at getting FPRs for different tests using simulated data: https://github.com/snastase/isc-tests. The notebook should render interactively via Binder (link; takes awhile to build). Feel free to fork, contribute, etc. I haven't implemented @cbaldassano's proposed method yet. I'm seeing FPRs at ~.10–.15 for a nominal significance level of p = .05 across all tests, but I'm only running 100 simulations for the sake of time—we'll need to see what this looks like at 1000 iterations. I'm having a little trouble conceptually linking this to Gang Chen's simulation approach because here we want to simulate time series, whereas they're simulating correlation pairs (with varying |
@tristansyates—interesting! I'm curious how |
I've never heard of binder before -- it's super cool! I've forked the directory and added @cbaldassano's proposed method (here.)The FPR rate is 0.05, demonstrating that it is a more conservative test. The actual fMRI data I've been working with also has some missing time points (we gaze code our participants and remove times when their eyes are not on the screen). (This is one of the reasons why we use the leave one out approach because we can nanmean the group). Along with simulating how correlations vary by non-uniformity, looking into the impact of missing time points may be something interesting to pursue as well. @snastase I'll check how phaseshift and timeshift measure in the sample and let you know! Currently thinking that it'll be a narrower distribution like the blue one above |
Thanks for setting up that notebook Sam! I created a pull request to add the flipping-timeseries test for comparison, and I just now realized that Tristan has a similar fork. Both Tristan's and mine are getting FPR=0.05 while the current brainiak methods are yielding higher FPRs. I'm not sure how to square this with the plots that Tristan and I were making in our own simulations though - maybe the relative shapes of these nulls (across methods) are different when there really is vs. is not a real signal in the data? |
You may have noticed I tagged this issue for the upcoming BrainIAK release. Do you think you will converge on a solution in the next few weeks? |
This will probably take a little more discussion to get a complete fix. I do think that my pull request (consistent with what Sam and Tristan were seeing) demonstrates that there is a problem with the way things are currently implemented in brainiak. I'm not sure whether we want to just say that the current way has some bias and flag this in the documentation, or if we think it is worth removing/deprecating the LOO brainiak functions. The solution I'm proposing would be done outside of the ISC function itself, so I don't know whether it is in scope to be part of the brainiak package itself. I'll leave it up to Sam about what to do, but my vote would be to have the isc code raise either a warning or error if someone tries to get p values from LOO ISC values. |
Sorry—lost track of this for a few weeks. I agree with @cbaldassano that this will take a bit more work to fully resolve, so I'm not sure we should hold up the release (I don't have a ton of bandwidth to do it in the next weeks). I don't think we should remove or deprecate the nonparametric tests for leave-one-out ISCs—I'm seeing FPRs .05–.10 across all of these methods (at |
I agree, for now maybe we can just add a documentation line mentioning the the LOO permutation test is not guaranteed to strictly control FPR < alpha. Applying sign-flipping to only the held-out subject is interesting - shouldn't that give large negative (not 0) correlations if there is a real signal? I guess that doesn't matter for null hypothesis testing though (where we only care about false positives in the case where there is no signal). Another alternative to sign-flipping that is maybe less disruptive could be doing circular time-shifts, which has worked well in my lab (at least for relatively long timecourses), but we can continue this discussion in the isc-tests repo. |
Joining the conversation: circular shifting has also worked well for us, but you need to make sure you shift of enough time points (rule of thumb is one hrf tail ~ 30 seconds). I’ve also used block time shifting, which helps a bit when time series are not that long. First you chunk the time series in block of random size (but long enough to preserve autocorrelation, eg min length ~ 50 seconds), then you shuffle the blocks. There’s a reference for this method, I can link it once I’m not on mobile. |
When using permutation_isc and bootstrap_isc with pairwise=False, the permutations are carried out by applying signflips, group exchanges, or bootstrap resampling to the correlation values of individual subjects with the rest of the group. For example, in the one-sample permutation each subject's LOO value is randomly flipped or not flipped:
brainiak/brainiak/isc.py
Line 938 in 69a0c93
I don't think this is correct. When using LOO ISC, each subject's data is used to compute ISC value (as part of the "group"), and flipping the sign of a subject's data would impact every LOO value for all subjects, not just the LOO value of that specific subject. I believe this problem is true for both the 1-sample and 2-sample permutation and bootstrap tests as they are currently coded.
The Chen2016 reference, as far as I can tell, only describes permutations and bootstraps in the pairwise case, and does not describe a valid way of doing this for the leave-one-out case. I can't think of a way to do this that would be valid, so my proposed solution would be to have these functions thrown an error if they are attempted to be called with LOO ISCs.
Sam, let me know if I am misunderstanding how these functions are working.
The text was updated successfully, but these errors were encountered: