Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Procedure #30

Open
tsalo opened this issue Oct 23, 2023 · 7 comments
Open

Procedure #30

tsalo opened this issue Oct 23, 2023 · 7 comments

Comments

@tsalo
Copy link
Member

tsalo commented Oct 23, 2023

  1. Initialize a YODA-style datalad dataset.
    datalad create -c yoda -D "Create superdataset for OpenNeuro dataset dsXXXXXX" "dsXXXXXX"`
    cd dsXXXXXX
    
  2. Fork the BIDS raw dataset from OpenNeuroDatasets to ME-ICA.
    • Make sure to deselect "Copy only master branch".
  3. Clone the BIDS raw dataset from the ME-ICA GitHub repo.
    datalad clone -d . -D "Clone of OpenNeuro dataset. May be modified to work with fMRIPrep/AFNI and pushed to G-Node GIN." https://github.com/ME-ICA/dsXXXXXX.git inputs/data
    cd inputs/data
    datalad get .
    
  4. Create empty fmriprep and afni output subdatasets.
    datalad create -d . -D "fMRIPrep derivatives for dsXXXXXX." outputs/fmriprep
    datalad create -d . -D "AFNI derivatives for dsXXXXXX." outputs/afni
    
  5. Create a G-Node GIN mirror for the dataset.
    datalad create-sibling-gin --siblingname gin --access-protocol ssh --dataset . ME-ICA/dsXXXXXX_superdataset
    datalad push --to gin
    
  6. Create a code folder in the inputs/data subdataset with scripts to fix any issues in the dataset.
    mkdir -p inputs/data/code
    
  7. Make changes to the dataset.
    datalad run XXX.py
    datalad push --to gin
    
  8. Publish the updated dataset to G-Node GIN (I don't have write permissions to the OpenNeuro dataset).
  9. Create a derivatives datalad dataset.
    mkdir /path/to/derivatives
    cd /path/to/derivatives
    datalad init .
    
  10. Create a G-Node GIN mirror for the derivatives datasets.
    datalad create-sibling-gin --siblingname gin --access-protocol ssh --dataset outputs/fmriprep ME-ICA/dsXXXXXX_fmriprep
    datalad push --to gin outputs/fmriprep
    datalad create-sibling-gin --siblingname gin --access-protocol ssh --dataset outputs/afni ME-ICA/dsXXXXXX_afni
    datalad push --to gin outputs/afni
    
  11. Run the preprocessing pipeline of choice.
    datalad run run_<fmriprep|afni>.sh
    
  12. Publish the derivatives dataset to G-Node GIN as a separate dataset from the raw data.
@mattcieslak
Copy link

we may want to suggest that this is done YODA style. You'd create a dataset first, add the input data as a subdataset, add the code and datalad run it. Then you can add a gin sibling and push to it

@tsalo
Copy link
Member Author

tsalo commented Oct 23, 2023

Copying the example tree @mattcieslak shared on Slack:

dsXXXXXX/
    inputs/
        data/  <-- subdataset from OpenNeuro
            code/  <-- code to fix BIDS issues in raw dataset
    outputs/
        fmriprep/  <-- subdataset of fMRIPrep derivatives
        afni/  <-- subdataset of AFNI derivatives
    code/  <-- code to run fMRIPrep and AFNI

@tsalo
Copy link
Member Author

tsalo commented Oct 23, 2023

@jsheunis what do you think of this approach? I was thinking that the multi-echo superdataset could then just point to inputs/data, outputs/fmriprep, and outputs/afni without retaining the YODA structure.

@jsheunis
Copy link

In general this makes sense to me. Some specific notes:

  • My understanding of your suggestion is to create a new multi-echo superdataset with new "wrapper" datasets with the proposed structure as its immediate subdatasets, i.e.:
multi-echo-super/
    dsXXXXXX/
        inputs/
            data/  <-- subdataset from OpenNeuro
                code/  <-- code to fix BIDS issues in raw dataset
        outputs/
            fmriprep/  <-- subdataset of fMRIPrep derivatives
            afni/  <-- subdataset of AFNI derivatives
        code/  <-- code to run fMRIPrep and AFNI
    dsYYYYYY/
        inputs/
        outputs/
        code/
    ...

is that correct?

  • is the code directory in the tree below just a directory or also a subdataset? if the latter, it seems to me that it doesn't have to be nested under data and could move one level up? this is minor though, and probably just personal preference.
dsXXXXXX/
    inputs/
        data/  <-- subdataset from OpenNeuro
            code/  <-- code to fix BIDS issues in raw dataset
  • If the content of code/ <-- code to run fMRIPrep and AFNI is a containerized pipeline that is applied in the same across multiple datasets, that would also make sense to structure as a subdataset, with parameterizations being specific to individual datasets.

@tsalo
Copy link
Member Author

tsalo commented Oct 23, 2023

My understanding of your suggestion is to create a new multi-echo superdataset with new "wrapper" datasets with the proposed structure as its immediate subdatasets, i.e.:

That's definitely an option, but my first thought was something like this:

multi-echo-super/
    raw/
        dsXXXXXX  <-- from dsXXXXXX/inputs/data/ in the open-multi-echo-data-generated workflow
        dsYYYYYY  <-- from dsYYYYYY/inputs/data/ in the open-multi-echo-data-generated workflow
    derivatives/
        dsXXXXXX_fmriprep/ <-- from dsXXXXXX/outputs/fmriprep/ in the open-multi-echo-data-generated workflow
        dsXXXXXX_afni  <-- from dsXXXXXX/outputs/afni/ in the open-multi-echo-data-generated workflow
        dsYYYYYY_fmriprep/ <-- from dsYYYYYY/outputs/fmriprep/ in the open-multi-echo-data-generated workflow
        dsYYYYYY_afni  <-- from dsYYYYYY/outputs/afni/ in the open-multi-echo-data-generated workflow
    ...

is the code directory in the tree below just a directory or also a subdataset?

I was leaning toward just a directory, but your point about containers makes sense. I would optimally have fMRIPrep and AFNI containers stored somewhere and referenced in the code directory.

@jsheunis
Copy link

Ah okay, that makes sense now, thanks for clarifying. I don't see a problem with this approach. In either case, the superdataset would just be a user-facing tree with nested subdatasets that are maintained individually, so then it wouldn't really matter how exactly the superdataset decides to structure its tree.

@tsalo
Copy link
Member Author

tsalo commented May 3, 2024

@mattcieslak I'd like to minimize the storage on CUBIC for this. What I'd really love to do is:

  1. datalad get the data
  2. Make any modifications necessary to prepare the dataset for fMRIPrep and AFNI.
  3. datalad drop any unmodified files, but keep the modified files on CUBIC.

Then, when running fMRIPrep or AFNI:

  1. Loop over subjects (or put them in an array job)
  2. datalad get only the selected subject's data
  3. Run the preprocessing pipeline on that subject
    • I ultimately want to push the preprocessing derivatives to OpenNeuro once I have the OpenNeuro authentication tool working on CUBIC.
  4. datalad drop any unmodified files from the raw dataset for the select subject

Do you think that's feasible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants