Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLad for multi-echo data access #13

Open
jsheunis opened this issue Mar 12, 2022 · 17 comments
Open

DataLad for multi-echo data access #13

jsheunis opened this issue Mar 12, 2022 · 17 comments

Comments

@jsheunis
Copy link

What do you think about using DataLad to streamline data access for publicly available ME datasets? It looks like all of the datasets used in the book that don't require a data use agreement are on OpenNeuro, i.e. they are already DataLad datasets. It will be easy to include those as subdatasets into a multi-echo "super dataset" that people can clone and then download individual subdatasets or files selectively.

Of course, we don't have to make DataLad a requirement for people working with the book's tutorials, so this could also just be an alternative for those who have datalad installed.

Additionally, if some tutorials can be run on Binder, we have this ready-made config for running datalad on binder: https://github.com/datalad/datalad-binder

@tsalo
Copy link
Member

tsalo commented Mar 14, 2022

The problem with the existing OpenNeuro datasets is that most don't have the echo-wise preprocessed data we need for our examples. We thought of just fMRIPrepping the open datasets ourselves and uploading the derivatives to OpenNeuro in separate "datasets" linking to the original ones, but OpenNeuro doesn't currently support uploading derivatives-only datasets (see OpenNeuroOrg/openneuro#2436), so I don't know if we can directly use OpenNeuro for most of our planned examples. Currently, we're looking at uploading fMRIPrep derivatives to the OSF and using a fetcher to grab them from there. Is there a storage alternative that would be more compatible with DataLad?

@tsalo
Copy link
Member

tsalo commented Mar 14, 2022

Chris actually mentioned G-Node in that issue, which I had forgotten. Would that be a good alternative?

I think we looked at it but decided against it for tedana's datasets module (see ME-ICA/tedana#684) because it would require a new dependency and no one was familiar with it.

@jsheunis
Copy link
Author

Yup GIN is a good option for public and free hosting of data (up to a number of terabytes per account/repo iirc). And it works well with standard DataLad functionality. See here for a walkthrough of how to publish/connect a DataLad dataset to GIN: https://handbook.datalad.org/en/latest/basics/101-139-gin.html

DataLad also has an extension for integrating with OSF, http://docs.datalad.org/projects/osf/en/latest/, so that's also a possibility.

I guess it depends on which dependencies are fine to include (if any at all) for which packages (tedana as a whole, vs only for the jupyter book). Looking at ME-ICA/tedana#684, DataLad can do all of that quite well, although I can understand hesitation before including new dependencies (for DataLad: mainly datalad, git and git-annex), vs building a light-weight module that does something specific with well-defined boundary conditions.

Either way, if DataLad is an alternative for getting data used in the book, I can see the superdataset having a structure like this:

public-multi-echo-data
├── raw
│   ├── ds1
│   ├── ds2
│   ...
│   └── dsN
├── derivatives
│   ├── ds1_deriv
│   ├── ds2_deriv
│   ...
│   └── dsN_deriv
 ...
└── README

where all raw or derivative datasets would essentially be submodules that symlink to these respective datasets, which are in turn either hosted on OpenNeuro (i.e. the raw datasets) or, for example, on GIN (i.e. derivative datasets). Having all of these structured as a hierarchy of nested datalad datasets makes it very easy for datalad to give users access to any specific (sub)datasets and/or files.

@jsheunis
Copy link
Author

Here's v1 of the super-dataset, currently containing only raw subdatasets that are hosted on OpenNeuro: https://github.com/jsheunis/multi-echo-super

@jsheunis
Copy link
Author

jsheunis commented Mar 16, 2022

The multi-echo-super dataset now has all open multi-echo datasets from OpenNeuro included (as far as I'm aware) and also the fmriprep processed data of the Multi-echo Cambridge dataset that's on OSF (see this comment)

@notZaki, did you use the OSF API to get file paths and urls in order to build the manifest.json file? If so, do you still have a script lying around? The manifest file was very useful in order to create a datalad dataset linking to the file storage on OSF. I want to do the same for the masking test dataset on OSF, which doesn't currently have a manifest.

@notZaki
Copy link

notZaki commented Mar 16, 2022

@jsheunis Here's a link to the manifest fie for the masking test dataset: manifest.json (might not last forever)

I made this julia package to make the json file. There is an example on the readme on how to produce such files. Alternatively, the osfclient package for python might also be able to do something similar, but I haven't used it.

@jsheunis
Copy link
Author

Oh, that's perfect, thanks @notZaki !

@jsheunis
Copy link
Author

And thanks for the pointers to your julia package and osfclient 👍

@notZaki
Copy link

notZaki commented Mar 16, 2022

@emdupre has also made csv files for fetching data, but I don't remember how that was done.

@emdupre
Copy link
Member

emdupre commented Mar 16, 2022

I had just grabbed them with Python requests; here's a short gist demonstrating the idea.

That really works best for flat directory structures, but for more nested ones you'll have to add another loop ! At some point I tried osfclient, but that might have been between OSF API versions, so IIRC it wasn't yet updated. I haven't tried more recently, though !

@jsheunis
Copy link
Author

Thanks! I'll update here in case I try the recent osfclient.

@tsalo
Copy link
Member

tsalo commented Nov 19, 2022

Is there a good way to use the datalad Python tool or repo2data to grab only a single folder from a G-Node GIN or datalad dataset? I think installing the whole dataset would take too long in some cases (e.g., with the Cambridge and Le Petit Prince fMRIPrep derivatives).

@jsheunis
Copy link
Author

jsheunis commented Nov 20, 2022

@tsalo Just to be sure we're talking about the same things, with "grab only a single folder" do you refer to retrieving actual file content, or just getting the file tree (from git)? And with "installing a whole dataset" do you mean install in the datalad sense (where the git repo is cloned, but file content is not (yet) retrieved), or do you mean retrieving all data locally?

With datalad you can clone (a.k.a. install) the whole dataset easily, e.g. :

$ datalad clone https://github.com/jsheunis/multi-echo-cambridge-fmriprep.git

This clones the dataset's git repo and some datalad config files, but no file content. It takes a few seconds. And then you can get (and drop) specific file content on demand, e.g. all files within a directory at a specified relative path:

$ cd multi-echo-cambridge-fmriprep
$ datalad get datalad get sub-20847/figures/*

get(ok): sub-20847/figures/sub-20847_task-rest_desc-rois_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-carpetplot_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-summary_T1w.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_space-MNI152NLin2009cAsym_T1w.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-summary_bold.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-confoundcorr_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-conform_T1w.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-validation_bold.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-compcorvar_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-about_T1w.html (file) [from web...]
  [2 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  get (ok: 12)

@tsalo
Copy link
Member

tsalo commented Nov 23, 2022

Sorry for the confusion.

Just to be sure we're talking about the same things, with "grab only a single folder" do you refer to retrieving actual file content, or just getting the file tree (from git)?

I'm referring to just getting the file tree.

And with "installing a whole dataset" do you mean install in the datalad sense (where the git repo is cloned, but file content is not (yet) retrieved), or do you mean retrieving all data locally?

I'm referring to installing in the datalad sense.

With datalad you can clone (a.k.a. install) the whole dataset easily

My concern is that datalad clone https://gin.g-node.org/ME-ICA/ds003643-fmriprep-derivatives took several hours to clone the Le Petit Prince fMRIPrep derivatives on my laptop, so I'm worried that running that on each build of the Jupyter Book would be an issue. I was hoping there might be a way to limit it to just a single subject's data.

Maybe more is indexed with git (vs. git-annex) on G-Node GIN by default, but it seemed like most non-nifti files were downloaded in the clone step.

@jsheunis
Copy link
Author

Thanks for clarifying, and for the link to the repo. It looks like the dataset has too many files in git vs git-annex. If you used datalad to create the dataset, the way you can control this is via configurations: https://handbook.datalad.org/en/latest/basics/101-122-config.html

A way you can amend the dataset such that files are moved from git to git-annex (and removed from the git history) is described here: http://handbook.datalad.org/en/latest/beyond_basics/101-162-springcleaning.html#getting-contents-out-of-git. It involves:

  • cloneing the dataset locally and getting all the file contents
  • using git-filter-repo to remove unwanted files from git
  • removing stale file content from the annex that aren't referenced anymore
  • git garbage collection
  • saving and pushing the dataset to the GIN sibling

This handbook chapter also describes other ways to keep dataset size small, e.g. using subdatasets per subject: http://handbook.datalad.org/en/latest/beyond_basics/101-161-biganalyses.html#calculate-in-greater-numbers

@tsalo
Copy link
Member

tsalo commented Nov 28, 2022

Ohhhh thanks! I'll try modifying the dataset. That will make using it way easier!

Do you have a recommendation for downloading the data for this book? Should we use datalad to clone the dataset and install one subject's data in a separate script (e.g., the download_data chapter), or can we use repo2data for this?

@jsheunis
Copy link
Author

Do you mean when downloading data for the book during the building process? I would say datalad is a good option, yes, if we do have all datasets available as datalad datasets (that was what I intended when creating this issue), and if the infrastructure that we're running the building process or the notebooks on had the requirements for datalad installed. I see there's a github action workflow using ubuntu to build the book, so it will be easy to add steps for installing git annex and datalad.

It looks like all the publicly available datasets listed in the book are already included in the multi-echo-super dataset here: https://github.com/jsheunis/multi-echo-super/tree/main/raw, and the derivatives are added as they are made available, so I think datalad should work.

The way to access individual subjects' files of specific datasets would then be:

datalad clone https://github.com/jsheunis/multi-echo-super # clones the superdataset, which is aware of its linked subdatasets, but these aren't cloned yet

# let's say we're interested in EuskalIBUR
cd multi-echo-super
datalad get --no-data raw/EuskalIBUR # this clones the subdataset at the provided path relative to the superdataset, but doesn't retrieve data content

# let's say we're interested in all data of "sub-001/ses-01"
cd raw/EuskalIBUR
datalad get sub-001/ses-01/*

# or if we want a very specific file
datalad get sub-001/ses-01/anat/sub-001_ses-01_T2w.nii.qz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants