Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding robustica option to ICA decomposition to achieve consistent results #1013

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

BahmanTahayori
Copy link

Given that the FastICA method depends on its seed value, in some cases the generated output can be substantially different. To overcome this issue, we added robustica (https://github.com/CRG-CNAG/robustica/tree/main/robustica) which is a python library based on Icasso as an option to the ICA decomposition. robustica clusters many iterations of FastICA to achieve consistent results.

Changes proposed in this pull request:

  • ica_method is added as an option (robustica or fastica) and the user can select between the two methods. The number of runs for when robustica is used can be determined by the user through setting n_robust_runs. The default is set to `robustica' with 30 number of robust runs. This default value is selected based on our experience using robustica for two different datasets with over 250 subjects' data.

  • The DBSCAN clustering algorithm has been implemented in a hard-coded manner.

@handwerkerd
Copy link
Member

I'm really excited to see this @BahmanTahayori. I'm going to start looking at it now.
I asked on https://mattermost.brainhack.org/brainhack/channels/tedana and no one was definitely planning to attend our tedana dev call this Thursday/Friday so I cancelled that meeting. If you were getting this ready to discuss at that meeting, I'll either recreate the meeting, even if it's just the two of us, or we can schedule a meeting at another time that's slightly more convenient in Australia.

Also, some of your style issues are due to us allowing recently python 3.12, which has a few more stringencies. Addressing the following might get the style check passing:

tedana/decomposition/ica.py:143:5: E722 do not use bare 'except'
tedana/decomposition/pca.py:397:71: E226 missing whitespace around arithmetic operator
tedana/decomposition/pca.py:397:88: E231 missing whitespace after ','
tedana/selection/selection_nodes.py:324:38: E231 missing whitespace after ','

rica.S_all, rica.clustering.labels_, rica.signs_, rica.orientation_
)

except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just starting to look into this, but I wanted to highlight that one of the style failures is that this except is bare and the style rules we're following require some sort of label like except AttributeError

Looking at this code, it seems like you're running RobustICA with the DBSCAN method and, if that doesn't converge, you're running it again with AgglomerativeClustering. Perhaps you can use an if/else statement and more directly look for clustering and add an LGR.warning to note when this happens.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant: BahmanTahayori#4.

@@ -0,0 +1,6 @@
"""Setting default values for ICA decomposition."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd lean against making this config file. Anyone who installs tedana using pip install tedana won't have easy access to editing or viewing this config file and it would be overwritten every time they reinstall or someone pulls a new version of the code from github. It would be one thing if tedana looking for a config file in some other default location (definitely worth considering), but I think having it with the code will cause more problems than benefits.

I'd lean towards keeping our current method of setting defaults with input options so it's more clear to the end-user what the used options are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing that the config file variables get to be used in both the Argument Parser and the variable defaults for tedana_workflow Given we've had a couple of bugs where those diverged, having them defined in one central place might be useful. I don't have a strong opinion on whether a config.py or global variables at the top of tedana.py make more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're concerned about defaults not matching across functions, we can just make all parameters require the parameter name (e.g., func(*, param1, param2)) and drop defaults from our internal functions. The only places where there could be default mismatch at that point are the CLI argument parser and the main workflow function. I think that's more straightforward than adding in a config file or using global variables.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without necessarily pushing for one solution or the other code-structure-wise, I would say that perhaps if this kind of solution is used, perhaps defaults.py would be better than config.py?

Also, while removing defaults from function arguments would reduce the problem, it might not remove it completely. Sometimes for a command-line option, you want to be able to state what the default behaviour or value will be within the help string. While it's easy to cross-reference within the add_argument() call what's quoted in the help string vs. what's set as the default, having both read from the same symbol is IMO more robust.

@@ -27,6 +27,7 @@ dependencies = [
"nibabel>=2.5.1",
"nilearn>=0.7",
"numpy>=1.16",
"robustica>=0.1.3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this dependency makes a lot of sense, but I want to highlight to @tsalo & others that this will also install two additional dependencies: scikit-learn-extra & https://tqdm.github.io/
We've had issues with adding modestly supported modules in the past (😱duecredit 😱) so we'll want to keep an eye on all three of these, particularly in relation to #934.

Also tqdm is a progress bar module. If we're going to require it, we might want to think if there are others parts of the code where it might be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's a problem with scikit-learn-extra in the Python 3.12 tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an opinion here - I had a branch where I played with putting the progress bar in for curvefit (I think) cause it is slow and can be confusing when it just sits there - I think it worked nicely but got distracted. Once this is pulled in, I'll look into doing that again. I don't think it makes sense for loglin, but could anyway (being forward looking, maybe it will be slow someday on some data?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. I've used tqdm before without issue (e.g., in NiMARE), so I'm all for adding progress bars. Not sure about applicability to loglin though. Don't we only loop over unique adaptive mask values there, rather than voxels?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - yeah, makes no sense there. I hadn't even checked it. Curve fit can use it, maybe figure creation? Honestly, once it punches through the ICA steps its hustling so may not be needed elsewhere.

Comment on lines 420 to 424
The applied ICA method. If set to fastica the FastICA from sklearn
library will be run once with the seed value. 'robustica' will run
'FastICA' n_robust_runs times and uses clustering methods to overcome
the randomness of the FastICA algorithm. When set to fastica n_robust_runs
will not be effective.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The applied ICA method. If set to fastica the FastICA from sklearn
library will be run once with the seed value. 'robustica' will run
'FastICA' n_robust_runs times and uses clustering methods to overcome
the randomness of the FastICA algorithm. When set to fastica n_robust_runs
will not be effective.
The applied ICA method. fastica runs FastICA from sklearn
once with the seed value. 'robustica' will run
'FastICA' n_robust_runs times and uses clustering methods to overcome
the randomness of the FastICA algorithm.
FastICA was the default in tedana version 23 and earlier.
robustica will be slower

@@ -277,6 +278,44 @@ def test_integration_five_echo(skip_integration):
check_integration_outputs(fn, out_dir)


def test_integration_robustica_five_echo(skip_integration):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration tests are the slowest tests. Our goal for testing is to aim for every line of code to be run somewhere in the tests. You've added integration tests for all three datasets. Are they testing different sections of the new robust ICA code or can we just run an integration test using robustICA on only one dataset? We might also be able to just swap one of the 3 datasets so that two are run using fastICA and one is run using robustICA.

The above is particularly true since these integration tests aren't checking if the outputted values are consistent, just that they fully run. If it might be useful to test the consistently of robust ICA related functions, that might be better down with tests of specific functions rather than through an integration test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an integration test or function-specific test, is there a way to make sure DBSCAN doesn't converge and the (current) exception to run AgglomerativeClustering is run instead?

no_outliers = np.count_nonzero(rica.clustering.labels_ == -1)
if no_outliers:
LGR.info(
"The DBSCAN clustering algorithm detected outliers when clustering "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the code can might run AgglomerativeClustering instead of DBSCAN, this message should account for that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific concern would be addressed by the restructure in BahmanTahayori#4.

Copy link
Member

@handwerkerd handwerkerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BahmanTahayori I've gone through the entire PR and made some comments. I generally like your approach, but most of my feedback is focused on making sure I understand what's happening and the messages to end users are clear.

I already noticed an issue with the 3-echo integration test where robust ICA outputs fewer than the specified number components.

The one big thing I'd still want to do before approving is to run this on a bunch of my own datasets to make sure it's stable and giving consistently plausible results.

robust_method="DBSCAN",
)

s, mmix = rica.fit_transform(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to make sure I understand this whole section and I'm seeing something that looks odd. When I run the test_integration_three_echo, n_components==69 and the resulting mmix is 75X69. When I run test_integration_robustica_three_echo, n_components==69, but mmix is 75x37 here and 75x36 by the time mmix is returned. The 3-echo dataset we use for integration testing is truncated so weird things happen, but it's still not great that RobustICA is non-trivially reducing the total number of expected components without a warning or message. Is this expected behavior for Robust ICA or is something wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, the robust ICA output is appropriate. The truncated dataset has only 75 volumes. MA-PCA with the AIC criterion is estimating 69 components, but that's too many. I'm not surprised that MA-PCA is failing in this case, because this is a weird truncated dataset. That RobustICA can only find 36 stable ICA components from 75 volumes makes sense.

),
default=42,
choices=range(5, 500),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we've set up the argument parser, it outputs all choices when someone types tedana --help I just noticed, for this type, it outputs the following. Not sure if there's a way to alter this behavior within the parse.

--n_robust_runs {5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499}
                        The number of times robustica will run.This is only effective when ica_method is set to robustica. (default: 30)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
choices=range(5, 500),
choices=range(5, 500),
metavar="[5-500]",

metavar can be used to set a range that's different from choiced (based on https://stackoverflow.com/questions/25295487/python-argparse-value-range-help-message-appearance) This solved the above issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While metavar addresses the help page issue, it's nevertheless unusual to be using choices for an integer input. Would it not suffice to just have upfront error on too few / warning on too many, + corresponding suggestions in the help text?

PS. One of the things I need to do for MRtrix3/mrtrix3#2678 is implement a callable to be provided to argparse.add_argument's type kwarg that would permit setting bounds on numeric arguments. So it's something that I've looked into recently. A design like this would work.

@handwerkerd
Copy link
Member

@BahmanTahayori and I talked yesterday. In addition to going through the above review comments, I wanted to highlight one other part of our discussion to other developers (@tsalo @eurunuela @dowdlelt @n-reddy @javiergcas @goodalse2019 @NaomiGaggi @marco7877 @CesarCaballeroGaudes ). Robust ICA seems to always output fewer components than initially requested. This is reasonable. If you have 100 time points and request 90 components, it should not be able to find 90 stable components. This is both a problem and a possible solution for our current issues with PCA dimensionality estimation.

On the problem side, even if we get a perfect dimensionality estimate to input into RobustICA, and RobustICA will still output fewer ICA components.

On the solution side, I don't really care about how many components are estimated in the PCA step, I care that our ICA components reliably estimate as much of the total structured variance of the data as possible. This means, we can potentially give RobustICA an initial number of components that is too large such as min(1/3 of time points, 95% of total variance) and RobustICA will settle on a smaller plausible number.

I've been testing this out on a few datasets and it's promising, but not yet perfect. For example, I tested it on a dataset with 160 time points and RobustICA was set to run FastICA for 30 iterations

Initial components Calculated stable ICA components Index Quality
40 37 0.9483
45 42 0.9519
50 38 0.9464
55 43 0.9501
60 46 0.9456
65 45 0.9442
70 49 0.9448
75 48 0.9300
94 49 0.9287

Ideally, the calculated number of stable components should be the same if we initialize to 65 or 55. That's not happening, but it is reasonably stable. One other thing that’s noteworthy is that I’m running this with a block design flashing checkerboard task, which almost always results in a high variance component with the weights in primary visual cortex and a time series that follows the block design task. I am NOT seeing that when I initialized with 94 components & those final results were qualitatively noiser. This might be a sign that if one overestimates the intial number of components too much, the method breaks down. What might be happening is the algorithm is trying to successfully run FastICA 30 times. When it fails to converge it tries again. For 94 components, it failed 20 times (running FastICA a total of 50 times). For 70 components, it failed to converge only 8 times. In addition to being computationally expensive, this means trying to get ICA to converge with way too many components is resulting in stranger results when it does converge.

I'm going to keep playing with this and encourage others to do so as well. I'm also playing with more iterations of FastICA (the new n_robust_runs option in this PR) to see if that results in a more stable number of components across a wider range of inputs. This should definitely be a discussion at our January developers call. If anyone wants to actively look into this between now and then I'll find time to talk more and help you get started.

For reference, the manuscript describing this robust ICA approach is https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05043-9
In parallel, I asked @jefferykclam to work on the dimensionality estimation issue and he was converging on a similar approach based on this publication: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0094943

BahmanTahayori and others added 4 commits February 9, 2024 14:48
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
* RobustICA: Restructure code loop over robust methods

* Addressing the issue with try/except

---------

Co-authored-by: Bahman <tahayori@gmail.com>
@dowdlelt
Copy link
Collaborator

Just wanted to jump in an officially register my excitement for this - I've checked this out and currently running on some trouble some data from neurostars. I think it will be enlightening.

I am also seeing a lot of failure to converge (~550 timepoints, req 180, ~65% var explained) and the phrase exponential back-off crossed my mind. Its not quite applicable here, but the central idea is if something fails, you don't just keep trying the same thing (back-off = more waiting). Here, at the risk of adding another layer of complexity (ugh) perhaps an option should be that too many failures (?) to converge should be logged, and the entire loop should be restarted with N few components? Not exponential, but maybe based on number of timepoints. If someone wild shows up with 1500 timepoints, and starts with 500, it might not be nice to step through 499, 498, etc. But If they have 200 and start with 66, might be smart to do 65, 64, etc.

In any case, I'm looking forward to seeing where this goes.

@@ -313,3 +313,13 @@ @misc{sochat2015ttoz
url = {https://doi.org/10.5281/zenodo.32508},
year = 2015
}

@Article{Anglada:2022,
Copy link
Collaborator

@dowdlelt dowdlelt Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be anglada2022?

Got the following near the end of a run

File "/Users/logan/local_code/tedana/tedana/reporting/html_report.py", line 41, in _cite2html
    first_author = bibliography.entries[key].persons["author"][0]
  File "/Users/logan/miniconda3/envs/py310/lib/python3.10/site-packages/pybtex/utils.py", line 163, in __getitem__
    return self._dict[key.lower()]
KeyError: 'anglada2022'

And I notice that all other files in the references.bib are in that lower format + no ':'. Maybe the case is handled, but seems like the colon not?

Or line 122 in ica.py should have the colon?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dowdlelt. Yes, that is anglada2022. I am currently working on finalising this PR and all these issues will be resolved soon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome news, looking forward to it!

@dowdlelt
Copy link
Collaborator

One more quick comment - perhaps it is the data I'm looking through, but I'm seeing a bizzare effect - and I feel like it would have been noticed by now if it was general.

I have components that have vastly different timecourses, and yet the maps are just various scaled versions of each other - practically pixel identical, except magnitude. I know this image won't be easy to tell - but there are about ...I don't know 70/100 components like this - effectively same spatial map, only scale changes.
image

This is neurostars data, so perhaps some violence was done to it, but I can't make sense of it. I'll try and poke around to see if I can learn something.

In this commit, some of the comments from Daniel Handwerker and Robert
Smith were incorporated.
@BahmanTahayori BahmanTahayori marked this pull request as draft February 29, 2024 09:06
* Fixing the problem of argument parser for n_robust_runs.

* Removing unnecessary tests from the test_integration. There are 3
  tests for echo as before, but the ica_method is robustica for five and
three echos and fatsica for the four echo test.
@Lestropie Lestropie mentioned this pull request Mar 23, 2024
1 task
Conflicts:
	tedana/workflows/tedana.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants