DataLad for multi-echo data access #13

jsheunis · 2022-03-12T15:00:56Z

What do you think about using DataLad to streamline data access for publicly available ME datasets? It looks like all of the datasets used in the book that don't require a data use agreement are on OpenNeuro, i.e. they are already DataLad datasets. It will be easy to include those as subdatasets into a multi-echo "super dataset" that people can clone and then download individual subdatasets or files selectively.

Of course, we don't have to make DataLad a requirement for people working with the book's tutorials, so this could also just be an alternative for those who have datalad installed.

Additionally, if some tutorials can be run on Binder, we have this ready-made config for running datalad on binder: https://github.com/datalad/datalad-binder

tsalo · 2022-03-14T16:51:15Z

The problem with the existing OpenNeuro datasets is that most don't have the echo-wise preprocessed data we need for our examples. We thought of just fMRIPrepping the open datasets ourselves and uploading the derivatives to OpenNeuro in separate "datasets" linking to the original ones, but OpenNeuro doesn't currently support uploading derivatives-only datasets (see OpenNeuroOrg/openneuro#2436), so I don't know if we can directly use OpenNeuro for most of our planned examples. Currently, we're looking at uploading fMRIPrep derivatives to the OSF and using a fetcher to grab them from there. Is there a storage alternative that would be more compatible with DataLad?

tsalo · 2022-03-14T17:14:48Z

Chris actually mentioned G-Node in that issue, which I had forgotten. Would that be a good alternative?

I think we looked at it but decided against it for tedana's datasets module (see ME-ICA/tedana#684) because it would require a new dependency and no one was familiar with it.

jsheunis · 2022-03-14T20:05:57Z

Yup GIN is a good option for public and free hosting of data (up to a number of terabytes per account/repo iirc). And it works well with standard DataLad functionality. See here for a walkthrough of how to publish/connect a DataLad dataset to GIN: https://handbook.datalad.org/en/latest/basics/101-139-gin.html

DataLad also has an extension for integrating with OSF, http://docs.datalad.org/projects/osf/en/latest/, so that's also a possibility.

I guess it depends on which dependencies are fine to include (if any at all) for which packages (tedana as a whole, vs only for the jupyter book). Looking at ME-ICA/tedana#684, DataLad can do all of that quite well, although I can understand hesitation before including new dependencies (for DataLad: mainly datalad, git and git-annex), vs building a light-weight module that does something specific with well-defined boundary conditions.

Either way, if DataLad is an alternative for getting data used in the book, I can see the superdataset having a structure like this:

public-multi-echo-data
├── raw
│   ├── ds1
│   ├── ds2
│   ...
│   └── dsN
├── derivatives
│   ├── ds1_deriv
│   ├── ds2_deriv
│   ...
│   └── dsN_deriv
 ...
└── README

where all raw or derivative datasets would essentially be submodules that symlink to these respective datasets, which are in turn either hosted on OpenNeuro (i.e. the raw datasets) or, for example, on GIN (i.e. derivative datasets). Having all of these structured as a hierarchy of nested datalad datasets makes it very easy for datalad to give users access to any specific (sub)datasets and/or files.

jsheunis · 2022-03-14T22:35:31Z

Here's v1 of the super-dataset, currently containing only raw subdatasets that are hosted on OpenNeuro: https://github.com/jsheunis/multi-echo-super

jsheunis · 2022-03-16T19:19:13Z

The multi-echo-super dataset now has all open multi-echo datasets from OpenNeuro included (as far as I'm aware) and also the fmriprep processed data of the Multi-echo Cambridge dataset that's on OSF (see this comment)

@notZaki, did you use the OSF API to get file paths and urls in order to build the manifest.json file? If so, do you still have a script lying around? The manifest file was very useful in order to create a datalad dataset linking to the file storage on OSF. I want to do the same for the masking test dataset on OSF, which doesn't currently have a manifest.

notZaki · 2022-03-16T21:14:18Z

@jsheunis Here's a link to the manifest fie for the masking test dataset: manifest.json (might not last forever)

I made this julia package to make the json file. There is an example on the readme on how to produce such files. Alternatively, the osfclient package for python might also be able to do something similar, but I haven't used it.

jsheunis · 2022-03-16T21:20:39Z

Oh, that's perfect, thanks @notZaki !

jsheunis · 2022-03-16T21:39:59Z

And thanks for the pointers to your julia package and osfclient 👍

notZaki · 2022-03-16T21:41:49Z

@emdupre has also made csv files for fetching data, but I don't remember how that was done.

emdupre · 2022-03-16T21:49:23Z

I had just grabbed them with Python requests; here's a short gist demonstrating the idea.

That really works best for flat directory structures, but for more nested ones you'll have to add another loop ! At some point I tried osfclient, but that might have been between OSF API versions, so IIRC it wasn't yet updated. I haven't tried more recently, though !

jsheunis · 2022-03-16T21:52:08Z

Thanks! I'll update here in case I try the recent osfclient.

tsalo · 2022-11-19T20:32:29Z

Is there a good way to use the datalad Python tool or repo2data to grab only a single folder from a G-Node GIN or datalad dataset? I think installing the whole dataset would take too long in some cases (e.g., with the Cambridge and Le Petit Prince fMRIPrep derivatives).

jsheunis · 2022-11-20T22:18:22Z

@tsalo Just to be sure we're talking about the same things, with "grab only a single folder" do you refer to retrieving actual file content, or just getting the file tree (from git)? And with "installing a whole dataset" do you mean install in the datalad sense (where the git repo is cloned, but file content is not (yet) retrieved), or do you mean retrieving all data locally?

With datalad you can clone (a.k.a. install) the whole dataset easily, e.g. :

$ datalad clone https://github.com/jsheunis/multi-echo-cambridge-fmriprep.git

This clones the dataset's git repo and some datalad config files, but no file content. It takes a few seconds. And then you can get (and drop) specific file content on demand, e.g. all files within a directory at a specified relative path:

$ cd multi-echo-cambridge-fmriprep
$ datalad get datalad get sub-20847/figures/*

get(ok): sub-20847/figures/sub-20847_task-rest_desc-rois_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-carpetplot_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-summary_T1w.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_space-MNI152NLin2009cAsym_T1w.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-summary_bold.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-confoundcorr_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-conform_T1w.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-validation_bold.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-compcorvar_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-about_T1w.html (file) [from web...]
  [2 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  get (ok: 12)

tsalo · 2022-11-23T15:09:20Z

Sorry for the confusion.

Just to be sure we're talking about the same things, with "grab only a single folder" do you refer to retrieving actual file content, or just getting the file tree (from git)?

I'm referring to just getting the file tree.

And with "installing a whole dataset" do you mean install in the datalad sense (where the git repo is cloned, but file content is not (yet) retrieved), or do you mean retrieving all data locally?

I'm referring to installing in the datalad sense.

With datalad you can clone (a.k.a. install) the whole dataset easily

My concern is that datalad clone https://gin.g-node.org/ME-ICA/ds003643-fmriprep-derivatives took several hours to clone the Le Petit Prince fMRIPrep derivatives on my laptop, so I'm worried that running that on each build of the Jupyter Book would be an issue. I was hoping there might be a way to limit it to just a single subject's data.

Maybe more is indexed with git (vs. git-annex) on G-Node GIN by default, but it seemed like most non-nifti files were downloaded in the clone step.

jsheunis · 2022-11-28T08:48:05Z

Thanks for clarifying, and for the link to the repo. It looks like the dataset has too many files in git vs git-annex. If you used datalad to create the dataset, the way you can control this is via configurations: https://handbook.datalad.org/en/latest/basics/101-122-config.html

A way you can amend the dataset such that files are moved from git to git-annex (and removed from the git history) is described here: http://handbook.datalad.org/en/latest/beyond_basics/101-162-springcleaning.html#getting-contents-out-of-git. It involves:

cloneing the dataset locally and getting all the file contents
using git-filter-repo to remove unwanted files from git
removing stale file content from the annex that aren't referenced anymore
git garbage collection
saving and pushing the dataset to the GIN sibling

This handbook chapter also describes other ways to keep dataset size small, e.g. using subdatasets per subject: http://handbook.datalad.org/en/latest/beyond_basics/101-161-biganalyses.html#calculate-in-greater-numbers

tsalo · 2022-11-28T14:36:04Z

Ohhhh thanks! I'll try modifying the dataset. That will make using it way easier!

Do you have a recommendation for downloading the data for this book? Should we use datalad to clone the dataset and install one subject's data in a separate script (e.g., the download_data chapter), or can we use repo2data for this?

jsheunis · 2022-11-30T21:16:56Z

Do you mean when downloading data for the book during the building process? I would say datalad is a good option, yes, if we do have all datasets available as datalad datasets (that was what I intended when creating this issue), and if the infrastructure that we're running the building process or the notebooks on had the requirements for datalad installed. I see there's a github action workflow using ubuntu to build the book, so it will be easy to add steps for installing git annex and datalad.

It looks like all the publicly available datasets listed in the book are already included in the multi-echo-super dataset here: https://github.com/jsheunis/multi-echo-super/tree/main/raw, and the derivatives are added as they are made available, so I think datalad should work.

The way to access individual subjects' files of specific datasets would then be:

datalad clone https://github.com/jsheunis/multi-echo-super # clones the superdataset, which is aware of its linked subdatasets, but these aren't cloned yet

# let's say we're interested in EuskalIBUR
cd multi-echo-super
datalad get --no-data raw/EuskalIBUR # this clones the subdataset at the provided path relative to the superdataset, but doesn't retrieve data content

# let's say we're interested in all data of "sub-001/ses-01"
cd raw/EuskalIBUR
datalad get sub-001/ses-01/*

# or if we want a very specific file
datalad get sub-001/ses-01/anat/sub-001_ses-01_T2w.nii.qz

jsheunis mentioned this issue Mar 15, 2022

Add dataset fetching module ME-ICA/tedana#684

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLad for multi-echo data access #13

DataLad for multi-echo data access #13

jsheunis commented Mar 12, 2022

tsalo commented Mar 14, 2022

tsalo commented Mar 14, 2022

jsheunis commented Mar 14, 2022

jsheunis commented Mar 14, 2022

jsheunis commented Mar 16, 2022 •

edited by tsalo

notZaki commented Mar 16, 2022

jsheunis commented Mar 16, 2022

jsheunis commented Mar 16, 2022

notZaki commented Mar 16, 2022

emdupre commented Mar 16, 2022

jsheunis commented Mar 16, 2022

tsalo commented Nov 19, 2022

jsheunis commented Nov 20, 2022 •

edited

tsalo commented Nov 23, 2022

jsheunis commented Nov 28, 2022

tsalo commented Nov 28, 2022

jsheunis commented Nov 30, 2022

DataLad for multi-echo data access #13

DataLad for multi-echo data access #13

Comments

jsheunis commented Mar 12, 2022

tsalo commented Mar 14, 2022

tsalo commented Mar 14, 2022

jsheunis commented Mar 14, 2022

jsheunis commented Mar 14, 2022

jsheunis commented Mar 16, 2022 • edited by tsalo

notZaki commented Mar 16, 2022

jsheunis commented Mar 16, 2022

jsheunis commented Mar 16, 2022

notZaki commented Mar 16, 2022

emdupre commented Mar 16, 2022

jsheunis commented Mar 16, 2022

tsalo commented Nov 19, 2022

jsheunis commented Nov 20, 2022 • edited

tsalo commented Nov 23, 2022

jsheunis commented Nov 28, 2022

tsalo commented Nov 28, 2022

jsheunis commented Nov 30, 2022

jsheunis commented Mar 16, 2022 •

edited by tsalo

jsheunis commented Nov 20, 2022 •

edited