12_discussion.tex

In neuroscience, results are often based on fMRI studies that suffer from a lack of power.  In order to save costs and effort while preserving sufficient power for detecting important effects, we presented a method to predict power for different sample sizes.  While other methods for power calculations in fMRI often require the estimation of many different parameters that are often difficult to estimate \citep{Mumford2008,Desmond2002}, our method is based on only peak values that require Random Field Theory assumptions for the computation of $p$-values.  {\color{Cyan} On the other hand, with the lack of need for a specific hypothesis of location comes the disadvantage of dimished local specificity: a power analysis is performed for \textbf{the average} activation in the brain (or mask).}

Our results indicate that the method works well when the size of activated brain regions is reasonably large.  When the activated region is small, the presented method underestimates the power largely and applying a restrictive mask helps the estimation.  This can be seen in the simulations, where bias is larger when only 2\% of the brain is active than when 6\% is active.  In the HCP data, we also see larger bias for the motor tasks, a contrast known for larger effect sizes but very local effects.  This indicates further that bias arises when the activated region is small, irrespective of the effect size.  Applying the method to small activated brain regions results in mostly conservative results for sample size calculations, i.e. overestimation of required sample size.  However, to confidently use the method, it is best to restrict the search region (i.e. apply a ROI mask {\color{Cyan}based on previous findings or anatomical masks, independent from the results from the pilot study}), an advice that generally applies to fMRI analysis.

Furthermore, the pilot study should be sufficiently high for the power analyses to work.  In the supplementary materials, we show the results for a smaller sample size ($n=10$), which results in larger biases for small effect sizes and small activation foci.  {\color{Cyan}Although a sample size of 15 is much larger than the pilot data that is often used to validate an experiment, it is crucial to provide sufficient degrees of freedom, not only for this power analysis, but for any type of power analysis.}

In our validations, we find upward bias when estimating prevalence of activation, and downward bias when estimating the effect sizes.  A possible source for this bias is the smoothing of the data.  In our simulations, we generate maps with a strict separation between null and non-null voxels.  However, the spatial smoothing of the data averages over null and effect voxels.  This lowers the effect sizes in the activated peaks, especially close to the border of the activated region and for small effect sizes.  \citet{Cheng2015} denoted these voxels as the transition region.  This could be an alternative explanation why our procedure performs worse for small activated regions, where voxels are always spatially close to the border between null and non-null, and for small effect sizes.

We focus on peak level inference for several reasons.  First, the use of peak level inference is increasingly being used in the fMRI literature.  Often, peak heights are the only measure reported that can be related to a standardised effect size.  Automated paper extraction tools such as NeuroSynth\footnote{http://www.neurosynth.org} and BrainSpell\footnote{http://www.brainspell.org} have large databases with peak data, which can in turn be used for meta-analyses.  Our power procedure, while not directly applicable to reported maxima, is a first step towards power analysis using reported effect sizes.  Second, we have shown in previous work that the assumption of a uniform distribution of the $p$-values under the null is attained with peak $p$-values, but not with cluster $p$-values \citep{Durnez2014}.  As this is an assumption crucial for the procedure presented here, we opt for peak inference, but not cluster inference.  Moreover, problems with localisation and stability have been reported with cluster inference \citep{Roels2014,Woo2014,Eklund2016}.  However, when a user wants to infer power for cluster inference, this procedure on peaks can be used as a lower bound, as the power of cluster inference should be generally higher than peak inference \citep{Friston2007}.  Lastly, we did not create a voxelwise power analysis tool as power analysis for voxelwise inference is already developed \citep{Hayasaka2007,Mumford2008}.
%% However, there is no reason why the presented method would not be applicable to voxels.  RFT FWER inferences for peaks should give similar results as voxelwise inference, as the FWER significance procedure controls the probability that the maximum of the field exceeds the significance threshold.  The maximum voxel will by definition be also the highest peak in the field.  As such, when controlling the FWER, peak or voxel inferences result in the same significance threshold.  This fact makes our power predictions relevant for that domain as well.

% This procedure is only for sample size calculations for mass univariate fMRI analyses.  We did not mention Bayesian analysis as the primary outcome for Bayesian analysis is the estimation of the posterior distribution rather than significance testing.  And while the Bayes factor allows an alternative for significance testing, the controlled error rate is different.  Whereas the Bayes factor optimises the rate between false positives and false negatives, in univariate fMRI analysis the focus is on false positive rate control.  For that same reason, we did not consider predictive machine learning approaches that aim at a maximal prediction accuracy.
{\color{Cyan}In this paper, we focused on power analysis for null hypothesis significance testing (NHST).  However, in the field of neuroimaging, different analysis strategies like machine learning and bayesian analysis are increasingly being used for signal localisation.  For those analysis types, the question of power is as relevant as it is for NHST: can we detect what we aim to detect.  However, given that the measured and/or optimised outcome of the significance procedures of these methods are different quantities (prediction accuracy / the bayes factor), this method can not be used for other analysis modalities than NHST.  However, increasing the sample size for all procedures will result in a better separation of null an alternative hypotheses, but the rate with which depends on the goal of the analysis, whether this is optimising the prediction accuracy, controlling the bayes factor or controlling the false positive rate.  }

We have evaluated the procedure using simulated data.  The data represent a simplified fMRI experiment but we still vary a number of parameters, like the effect size and the thresholding procedure to ensure that the findings are generalizable to a range of different possible fMRI experiments.  In our simulations, we have used a constant effect size of activation over different subjects.  We have not applied subject-specific effect sizes, as we believe this would not alter the average effect size, but rather it would inflate the total variance, leading to a smaller normalized effect size.
Thus we have considered only varying the average effect size $\mu_1$ and not separately the between subject variance.

% A note should be made on the influence of the screening threshold $u$, required by our  model.  This procedure will only model peaks above the screening threshold $u$ and can therefore only predict the detection power for all activation above the screening threshold $u$.  As such, the final interpretation of average peak power is also conditional on $u$: the procedure predicts the probability to detect active peaks above $u$ among all active peaks above $u$, $P(Z^u_j > z_\alpha)$. The power predictions are a function of the number of peaks detected by the inference procedure over the number of truly activated peaks. With higher threshold, the number of truly activated peaks will be underestimated, therefore the power may be overestimated. Consequently, the number of participants needed according to the power predictions can be seen as a lower bound for absolute average power $P(Z_j > z_\alpha)$. The higher the screening threshold $u$, the larger the deviance between the conditional power and the absolute power, but a lower screening threshold $u$ violates the random field theory assumptions that we implicitly make.  Our validations with $u=2.3$, which is reasonably low and the standard value in FSL, show good results.  Therefore our recommendation is to set this thresold to $u=2.3$ if the inference strategy is without a screening threshold.  Another consequence of this effect is that the pilot dataset should be reasonably large, to ensure that the peak statistics in activated regions exceeds the screening threshold.  We explain the interaction between pilot sample size and screening threshold in more details in the supplementary materials.

This method is only a first step in developing a means to better predict the power of fMRI studies.  Many different extensions are possible.  One of these possibilities is the development of a testing procedure that would allow to use the pilot data in the final study without harming the false positive rate (see, e.g., similar ideas in genetics \citealt{Skol2006}).  Second, the estimated effect size could incorporate other characteristics besides sample size, like intrasubject variance or scan time \citep{Mumford2008}  These additional parameters would allow the optimization of future studies without the restriction that all characteristics are identical to the pilot study.

Although the evaluation on this method was performed on whole-brain analyses, it is also possible to only apply it to a certain part of the brain, when a region of interest is specified.

We have made the procedure available to the community in a toolbox which is publicly available at \url{www.neuropowertools.org}, for which the code can be found at \url{https://github.com/neuropower/neuropower}.  All code used for the validations and example in this paper are available online \url{http://github.com/jokedurnez/neuropower-validation/}.

\subsection*{Acknowledgements}
We would like to thank Dr. Deanna Barch and Dr. Greg Burgess for their kind help in harvesting the HCP data and comments.
This work was partially supported by the Laura and John Arnold Foundation.  Jasper Degryse was supported by the Fund for Scientific Research-Flanders (FWO-V).  Joke Durnez has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 706561.
The computational resources (STEVIN Supercomputer Infrastructure) and services used in this work were kindly provided by Ghent University, the Flemish Supercomputer Center (VSC), the Hercules Foundation and the Flemish Government and department EWI.
We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that have contributed to these research results.
The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu
Lastly, we would like to thank the reviewers for their helpful comments.