Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-tedpca command-line interface #956

Closed
Lestropie opened this issue Jun 29, 2023 · 3 comments · May be fixed by #1066
Closed

-tedpca command-line interface #956

Lestropie opened this issue Jun 29, 2023 · 3 comments · May be fixed by #1066

Comments

@Lestropie
Copy link

Relates to discussion in #949.

Note that I am approaching this as someone with no experience with specifically TEDANA, but plenty of experience in research software.

In 23.0.x, there is a single command-line option "-tedpca" that controls how much data will pass from the PCA to the ICA. There are multiple ways in which this option can be used:

  1. Specify a string corresponding to one of a set of built-in rank selection methods.
  2. Specify a floating-point value corresponding to a fraction of variance to be preserved.
  3. Specify an integer value corresponding to a number of components to be preserved.

The relevant logic to tease these apart is here.

This however introduces a few problems:

  1. For a rank selection method, one cannot use the built-in argparse capability to enforce that the user input corresponds to one of a set of choices.
  2. The almost identical values 0.99999 and 1.0 will have polar opposite empirical behaviour. The former preserves all components; the latter rejects all but one component due to being interpreted as an integer.

From a software user interface perspective, the way that I would approach this personally would be an argument group, with three mutually exclusive options, eg.:

  1. -tedpca_method, of type choice
  2. -tedpca_rank, of type integer
  3. -tedpca_variance, of type float

This would preclude erroneous user input at the point of command-line parsing.

Further, in our particular circumstance (lead by @BahmanTahayori), where we want to bypass the PCA step here, that would be properly achieved:

  • In the short term using -tedpca_variance 1.0.
  • In the long term with a fourth option eg. -tedpca_skip to be added to this option group.
@handwerkerd
Copy link
Member

For better and worse, we are calling sklearn.decomposition.PCA and we are replicating their terminology ( https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html )

The actual call in tedana is here:
ppca = PCA(copy=False, n_components=algorithm, svd_solver="full")

I fully agree that this is a very confusing convention and shouldn't exist, but I'm not sure if using a different option to send the same parameter to another function might create a different type of confusion.

In general, this hold PCA method parsing step is a messy function with a lot of conditional logic and it's likely we'll want to add more options. I'm at OHBM now, but I'm very open to rethinking both how this section of the workflow is organized and what options can more clearly be presented.

Since this would be a breaking change for some workflows, I'd like to think enough to make a change only once so that we won't have multiple breaks.

@Lestropie
Copy link
Author

we are replicating their terminology

OK, that's useful to know the origin / justification for the interface. I suppose though there can be different demands between an API and a CLI, since with the latter the user technical skill floor is lower and everything is a string. Can optimise for the latter and massage back for the former.

it's likely we'll want to add more options

I'm at the end of my scope of knowledge of TE-dependent (P|I)CA as it is, so would be leaning entirely on the experts here.

Since this would be a breaking change for some workflows, I'd like to think enough to make a change only once

Completely agree.

From a user interface perspective, at least if one were to attempt to use -tedpca with an updated version, argparse should produce a command-line usage warning that that usage is ambiguous and print all possible matching options. Won't however help any workflow engines that wrap the software.

From a software engineering perspective, you may not want to have such backward-incompatible changes being merged to main interspersed with bug fixes and the like.

@Lestropie
Copy link
Author

In meeting 2024-03-21 it was decided to preserve the current command-line option, but refine the internal translation of a user-provided input into the internal representation. Discussion around that alternative solution will best occur in the relevant Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants