New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding robustica option to ICA decomposition to achieve consistent results #1013
base: main
Are you sure you want to change the base?
Conversation
I'm really excited to see this @BahmanTahayori. I'm going to start looking at it now. Also, some of your style issues are due to us allowing recently python 3.12, which has a few more stringencies. Addressing the following might get the style check passing:
|
tedana/decomposition/ica.py
Outdated
rica.S_all, rica.clustering.labels_, rica.signs_, rica.orientation_ | ||
) | ||
|
||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just starting to look into this, but I wanted to highlight that one of the style failures is that this except
is bare and the style rules we're following require some sort of label like except AttributeError
Looking at this code, it seems like you're running RobustICA
with the DBSCAN
method and, if that doesn't converge, you're running it again with AgglomerativeClustering
. Perhaps you can use an if/else
statement and more directly look for clustering and add an LGR.warning
to note when this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant: BahmanTahayori#4.
@@ -0,0 +1,6 @@ | |||
"""Setting default values for ICA decomposition.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd lean against making this config file. Anyone who installs tedana
using pip install tedana
won't have easy access to editing or viewing this config file and it would be overwritten every time they reinstall or someone pulls a new version of the code from github. It would be one thing if tedana
looking for a config file in some other default location (definitely worth considering), but I think having it with the code will cause more problems than benefits.
I'd lean towards keeping our current method of setting defaults with input options so it's more clear to the end-user what the used options are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm noticing that the config file variables get to be used in both the Argument Parser and the variable defaults for tedana_workflow
Given we've had a couple of bugs where those diverged, having them defined in one central place might be useful. I don't have a strong opinion on whether a config.py
or global variables at the top of tedana.py
make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're concerned about defaults not matching across functions, we can just make all parameters require the parameter name (e.g., func(*, param1, param2)
) and drop defaults from our internal functions. The only places where there could be default mismatch at that point are the CLI argument parser and the main workflow function. I think that's more straightforward than adding in a config file or using global variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without necessarily pushing for one solution or the other code-structure-wise, I would say that perhaps if this kind of solution is used, perhaps defaults.py
would be better than config.py
?
Also, while removing defaults from function arguments would reduce the problem, it might not remove it completely. Sometimes for a command-line option, you want to be able to state what the default behaviour or value will be within the help string. While it's easy to cross-reference within the add_argument()
call what's quoted in the help string vs. what's set as the default, having both read from the same symbol is IMO more robust.
@@ -27,6 +27,7 @@ dependencies = [ | |||
"nibabel>=2.5.1", | |||
"nilearn>=0.7", | |||
"numpy>=1.16", | |||
"robustica>=0.1.3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this dependency makes a lot of sense, but I want to highlight to @tsalo & others that this will also install two additional dependencies: scikit-learn-extra & https://tqdm.github.io/
We've had issues with adding modestly supported modules in the past (😱duecredit 😱) so we'll want to keep an eye on all three of these, particularly in relation to #934.
Also tqdm
is a progress bar module. If we're going to require it, we might want to think if there are others parts of the code where it might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's a problem with scikit-learn-extra
in the Python 3.12 tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an opinion here - I had a branch where I played with putting the progress bar in for curvefit
(I think) cause it is slow and can be confusing when it just sits there - I think it worked nicely but got distracted. Once this is pulled in, I'll look into doing that again. I don't think it makes sense for loglin, but could anyway (being forward looking, maybe it will be slow someday on some data?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. I've used tqdm before without issue (e.g., in NiMARE), so I'm all for adding progress bars. Not sure about applicability to loglin though. Don't we only loop over unique adaptive mask values there, rather than voxels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - yeah, makes no sense there. I hadn't even checked it. Curve fit can use it, maybe figure creation? Honestly, once it punches through the ICA steps its hustling so may not be needed elsewhere.
tedana/workflows/tedana.py
Outdated
The applied ICA method. If set to fastica the FastICA from sklearn | ||
library will be run once with the seed value. 'robustica' will run | ||
'FastICA' n_robust_runs times and uses clustering methods to overcome | ||
the randomness of the FastICA algorithm. When set to fastica n_robust_runs | ||
will not be effective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The applied ICA method. If set to fastica the FastICA from sklearn | |
library will be run once with the seed value. 'robustica' will run | |
'FastICA' n_robust_runs times and uses clustering methods to overcome | |
the randomness of the FastICA algorithm. When set to fastica n_robust_runs | |
will not be effective. | |
The applied ICA method. fastica runs FastICA from sklearn | |
once with the seed value. 'robustica' will run | |
'FastICA' n_robust_runs times and uses clustering methods to overcome | |
the randomness of the FastICA algorithm. | |
FastICA was the default in tedana version 23 and earlier. | |
robustica will be slower |
tedana/tests/test_integration.py
Outdated
@@ -277,6 +278,44 @@ def test_integration_five_echo(skip_integration): | |||
check_integration_outputs(fn, out_dir) | |||
|
|||
|
|||
def test_integration_robustica_five_echo(skip_integration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration tests are the slowest tests. Our goal for testing is to aim for every line of code to be run somewhere in the tests. You've added integration tests for all three datasets. Are they testing different sections of the new robust ICA code or can we just run an integration test using robustICA on only one dataset? We might also be able to just swap one of the 3 datasets so that two are run using fastICA and one is run using robustICA.
The above is particularly true since these integration tests aren't checking if the outputted values are consistent, just that they fully run. If it might be useful to test the consistently of robust ICA related functions, that might be better down with tests of specific functions rather than through an integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an integration test or function-specific test, is there a way to make sure DBSCAN doesn't converge and the (current) exception to run AgglomerativeClustering is run instead?
tedana/decomposition/ica.py
Outdated
no_outliers = np.count_nonzero(rica.clustering.labels_ == -1) | ||
if no_outliers: | ||
LGR.info( | ||
"The DBSCAN clustering algorithm detected outliers when clustering " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the code can might run AgglomerativeClustering instead of DBSCAN, this message should account for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific concern would be addressed by the restructure in BahmanTahayori#4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BahmanTahayori I've gone through the entire PR and made some comments. I generally like your approach, but most of my feedback is focused on making sure I understand what's happening and the messages to end users are clear.
I already noticed an issue with the 3-echo integration test where robust ICA outputs fewer than the specified number components.
The one big thing I'd still want to do before approving is to run this on a bunch of my own datasets to make sure it's stable and giving consistently plausible results.
tedana/decomposition/ica.py
Outdated
robust_method="DBSCAN", | ||
) | ||
|
||
s, mmix = rica.fit_transform(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to make sure I understand this whole section and I'm seeing something that looks odd. When I run the test_integration_three_echo
, n_components
==69 and the resulting mmix
is 75X69. When I run test_integration_robustica_three_echo
, n_components
==69, but mmix
is 75x37 here and 75x36 by the time mmix
is returned. The 3-echo dataset we use for integration testing is truncated so weird things happen, but it's still not great that RobustICA is non-trivially reducing the total number of expected components without a warning or message. Is this expected behavior for Robust ICA or is something wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, the robust ICA output is appropriate. The truncated dataset has only 75 volumes. MA-PCA with the AIC criterion is estimating 69 components, but that's too many. I'm not surprised that MA-PCA is failing in this case, because this is a weird truncated dataset. That RobustICA can only find 36 stable ICA components from 75 volumes makes sense.
tedana/workflows/tedana.py
Outdated
), | ||
default=42, | ||
choices=range(5, 500), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we've set up the argument parser, it outputs all choices when someone types tedana --help
I just noticed, for this type, it outputs the following. Not sure if there's a way to alter this behavior within the parse.
--n_robust_runs {5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499}
The number of times robustica will run.This is only effective when ica_method is set to robustica. (default: 30)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
choices=range(5, 500), | |
choices=range(5, 500), | |
metavar="[5-500]", |
metavar can be used to set a range that's different from choiced (based on https://stackoverflow.com/questions/25295487/python-argparse-value-range-help-message-appearance) This solved the above issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While metavar addresses the help page issue, it's nevertheless unusual to be using choices
for an integer input. Would it not suffice to just have upfront error on too few / warning on too many, + corresponding suggestions in the help text?
PS. One of the things I need to do for MRtrix3/mrtrix3#2678 is implement a callable to be provided to argparse.add_argument
's type
kwarg that would permit setting bounds on numeric arguments. So it's something that I've looked into recently. A design like this would work.
@BahmanTahayori and I talked yesterday. In addition to going through the above review comments, I wanted to highlight one other part of our discussion to other developers (@tsalo @eurunuela @dowdlelt @n-reddy @javiergcas @goodalse2019 @NaomiGaggi @marco7877 @CesarCaballeroGaudes ). Robust ICA seems to always output fewer components than initially requested. This is reasonable. If you have 100 time points and request 90 components, it should not be able to find 90 stable components. This is both a problem and a possible solution for our current issues with PCA dimensionality estimation. On the problem side, even if we get a perfect dimensionality estimate to input into RobustICA, and RobustICA will still output fewer ICA components. On the solution side, I don't really care about how many components are estimated in the PCA step, I care that our ICA components reliably estimate as much of the total structured variance of the data as possible. This means, we can potentially give RobustICA an initial number of components that is too large such as I've been testing this out on a few datasets and it's promising, but not yet perfect. For example, I tested it on a dataset with 160 time points and RobustICA was set to run FastICA for 30 iterations
Ideally, the calculated number of stable components should be the same if we initialize to 65 or 55. That's not happening, but it is reasonably stable. One other thing that’s noteworthy is that I’m running this with a block design flashing checkerboard task, which almost always results in a high variance component with the weights in primary visual cortex and a time series that follows the block design task. I am NOT seeing that when I initialized with 94 components & those final results were qualitatively noiser. This might be a sign that if one overestimates the intial number of components too much, the method breaks down. What might be happening is the algorithm is trying to successfully run FastICA 30 times. When it fails to converge it tries again. For 94 components, it failed 20 times (running FastICA a total of 50 times). For 70 components, it failed to converge only 8 times. In addition to being computationally expensive, this means trying to get ICA to converge with way too many components is resulting in stranger results when it does converge. I'm going to keep playing with this and encourage others to do so as well. I'm also playing with more iterations of FastICA (the new For reference, the manuscript describing this robust ICA approach is https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05043-9 |
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
* RobustICA: Restructure code loop over robust methods * Addressing the issue with try/except --------- Co-authored-by: Bahman <tahayori@gmail.com>
Just wanted to jump in an officially register my excitement for this - I've checked this out and currently running on some trouble some data from neurostars. I think it will be enlightening. I am also seeing a lot of failure to converge (~550 timepoints, req 180, ~65% var explained) and the phrase exponential back-off crossed my mind. Its not quite applicable here, but the central idea is if something fails, you don't just keep trying the same thing (back-off = more waiting). Here, at the risk of adding another layer of complexity (ugh) perhaps an option should be that too many failures (?) to converge should be logged, and the entire loop should be restarted with N few components? Not exponential, but maybe based on number of timepoints. If someone wild shows up with 1500 timepoints, and starts with 500, it might not be nice to step through 499, 498, etc. But If they have 200 and start with 66, might be smart to do 65, 64, etc. In any case, I'm looking forward to seeing where this goes. |
tedana/resources/references.bib
Outdated
@@ -313,3 +313,13 @@ @misc{sochat2015ttoz | |||
url = {https://doi.org/10.5281/zenodo.32508}, | |||
year = 2015 | |||
} | |||
|
|||
@Article{Anglada:2022, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be anglada2022
?
Got the following near the end of a run
File "/Users/logan/local_code/tedana/tedana/reporting/html_report.py", line 41, in _cite2html
first_author = bibliography.entries[key].persons["author"][0]
File "/Users/logan/miniconda3/envs/py310/lib/python3.10/site-packages/pybtex/utils.py", line 163, in __getitem__
return self._dict[key.lower()]
KeyError: 'anglada2022'
And I notice that all other files in the references.bib
are in that lower format + no ':'. Maybe the case is handled, but seems like the colon not?
Or line 122 in ica.py should have the colon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dowdlelt. Yes, that is anglada2022. I am currently working on finalising this PR and all these issues will be resolved soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome news, looking forward to it!
In this commit, some of the comments from Daniel Handwerker and Robert Smith were incorporated.
* Fixing the problem of argument parser for n_robust_runs. * Removing unnecessary tests from the test_integration. There are 3 tests for echo as before, but the ica_method is robustica for five and three echos and fatsica for the four echo test.
Conflicts: tedana/workflows/tedana.py
Given that the
FastICA
method depends on its seed value, in some cases the generated output can be substantially different. To overcome this issue, we addedrobustica
(https://github.com/CRG-CNAG/robustica/tree/main/robustica) which is a python library based on Icasso as an option to the ICA decomposition.robustica
clusters many iterations ofFastICA
to achieve consistent results.Changes proposed in this pull request:
ica_method
is added as an option (robustica or fastica) and the user can select between the two methods. The number of runs for when robustica is used can be determined by the user through settingn_robust_runs
. The default is set to `robustica' with 30 number of robust runs. This default value is selected based on our experience using robustica for two different datasets with over 250 subjects' data.The DBSCAN clustering algorithm has been implemented in a hard-coded manner.