Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directory structure organization and file naming convention #4

Open
SylvainTakerkart opened this issue Sep 10, 2020 · 23 comments
Open

Comments

@SylvainTakerkart
Copy link
Collaborator

SylvainTakerkart commented Sep 10, 2020

This issue is where we'll discuss the directory structure that will contain the data and metadata, the organization of the directory and sub-directories, the number of levels in the hierarchy of the directories, the naming of the directories and sub-directories, the naming of the files...

The different elements that might be included in this hierarchy are: the experiment itself, the subject, the recording session etc.

@SylvainTakerkart
Copy link
Collaborator Author

SylvainTakerkart commented Sep 10, 2020

We can of course (and should) get inspiration from how BIDS do things for human neuroimaging. Here is an example from the BIDS main webpage (on the right handside): https://bids.neuroimaging.io/assets/img/dicom-reorganization-transparent-white_1000x477.png . But we obviously might deviate from this!

@chrisvdt
Copy link

I would like to suggest the following hierarchy:
Project_name /Dataset_name /subject_name /date_time_session /raw (versus /derived)
within such a folder, all files associated with the same session could have the same ID , which makes it easy to automate retrieval of files.

@jbpoline
Copy link
Member

Learning from the BIDS issues to not repeat them would be good as well, there is a project to have a json schema to describe BIDS that I think should be used as a model, to avoid the inconsitencies that we have in BIDS. This PR is pointing to the json schema code in BIDS

@SylvainTakerkart
Copy link
Collaborator Author

I would like to suggest the following hierarchy:
Project_name /Dataset_name /subject_name /date_time_session /raw (versus /derived)
within such a folder, all files associated with the same session could have the same ID , which makes it easy to automate retrieval of files.

thanks for this suggestion!

an interesting (and necessary) topic we'll have to discuss, will be to come to an agreement on what is a "project", what is a "dataset" etc. ; from all our discussions with the experimenters (mostly electrophysiologists working on non-human primate and rodents) within our institute, this was not as simple as one could expect ;)

@chrisvdt
Copy link

I would like to suggest the following hierarchy:
Project_name /Dataset_name /subject_name /date_time_session /raw (versus /derived)
within such a folder, all files associated with the same session could have the same ID , which makes it easy to automate retrieval of files.

thanks for this suggestion!

an interesting (and necessary) topic we'll have to discuss, will be to come to an agreement on what is a "project", what is a "dataset" etc. ; from all our discussions with the experimenters (mostly electrophysiologists working on non-human primate and rodents) within our institute, this was not as simple as one could expect ;)

Yes, I can understand this, we've also had our discussions about this. So a bit of explanation. For us Project is like the introduction to an article, it covers the background or your research, affiliations, authors, hypotheses. and related work.
Within a project you might run different experiments, using different methods, with different sets of animals. So here we think you should define different datasets. Datasets prescribe how data should be preprocessed, which is determined by which subjects were included, and what methods were used. This should be consistent within a dataset.
Below this level could be more levels, but at the least I think there should be a Subject level and a Date_time_session level.

I've included a distinction between raw and derived, because when you share data you could decide only to share raw or only derived, and having this separation in the basis would make this easier. (I think our aim should be to make data more accessible and shareable.)

@SylvainTakerkart
Copy link
Collaborator Author

thanks for your explanations!

Project_name /Dataset_name /subject_name /date_time_session /raw (versus /derived)

We're actually very close to what you're suggesting in what we're locally trying to implement... Here is what we're suggesting for now:
exp-NAME/sub-GUID/ses-YYYYMMDD_XXX_BBBB/rawdata
exp-NAME/sub-GUID/ses-YYYYMMDD_XXX_BBBB/derivatives
exp-NAME/sub-GUID/ses-YYYYMMDD_XXX_BBBB/metadata

Interestingly, we did not use the concepts of Project and Dataset, but we used the concept of Experiment, with one less level of hierarchy compared to what you describe... Our specs are here: https://int-nit.github.io/AnDOChecker/ . And we have the equivalent of the BIDS validator running for these specs: https://andocheck.int.univ-amu.fr/ . But we're still considering this is under development (we only have a few beta testers internally for now...) so we're totally up for merging / fusing / adapting / tweeking, which is what we hope can happen with this discussion!!!

Some details:

  • we're pushing towards using a GUID as subject name (which is a bit difficult for investigators working with one-three monkeys, but is more easily acceptable for rodent-based research);
  • to characterize the session, we also use the date, but XXX is a user defined session number (where you use the Time); and we left BBBB as an formatless string so that users can add what they want...

@SylvainTakerkart
Copy link
Collaborator Author

pinging @satra @yarikoptic @bendichter: you guys made the chose to go flatter (instead of using several levels of hierarchy as in BIDS and as the two aforementioned suggestions) with dandi, right? could you quickly summarize why?

@SylvainTakerkart
Copy link
Collaborator Author

pinging @tgbugs; could you also tell us what you came up with for sparc?

@SylvainTakerkart
Copy link
Collaborator Author

pinging @lepmik; also, how do you think such directory structure would interact with your exdir?

@bendichter
Copy link

bendichter commented Sep 28, 2020

@SylvainTakerkart NWB files have internal structure that accommodates raw data, derived data, and metadata all in one file. The original DANDI layout deferred to NWB files for within-session data organization, and the file organization only handles the super-session info: dandiset and subject. That said, we have been talking on the DANDI team about separating raw from derived data as you have here, and we can discuss changing our format to match yours in that aspect.

@satra
Copy link
Member

satra commented Sep 28, 2020

@SylvainTakerkart - thanks for the ping. here are a few thoughts on the issue.

in bids i made a suggestion for consideration of individual level data vs data that is generated from aggregation of information across individuals (bids-standard/bids-2-devel#43). it makes for a cleaner and objective separation of information that is contained in the folder. with the growing hardware ecosystem in neurophys for free behaving experiments (openephys + bonsai), we are also likely to see data that involves multiple interacting participants. we would still be able to organize information from the point of view of a participant, but that may include data about other participants.

the words rawdata, derivatives, and metadata are often linked quite directly to acquisition instruments, experimental techniques, pipelines, and implementation strategies. what is a derivative today could become raw tomorrow, as instruments integrate more processing into them. i would suggest avoiding that nomenclature if possible. this was also another reason for the participant level organization.

re: flattening in DANDI:
the primary reason is the use of NWB, which itself is an organized store of information and metadata and contains most of the details within one construct. we did not want to replicate it outside and limit any information replication to the kinds of information that a neurophys researcher may typically want from perusing the filename.

another consideration in DANDI is to consider a world of objects where the data does not hit the filesystem ever or the filesystem is an object store. this is getting increasingly true for larger data with APIs providing access to information. In such a mode the organization in a typical folder level may be a short term consideration for people working on laptops/desktops with storage that is local and in more traditional HPC settings. given some of the data coming into DANDI, data transfer in general to do things would be too expensive (from a time and perhaps from a cost perspective). therefore accessing pieces of information as necessary for computation is where we are likely headed. this doesn't mean you cannot store information in a filesystem, but we are going to be moving a little closer even to our API for our data search clients and use datalad as the filesystem model. if you consider pybids, the first thing it does is generate a database index. but unlike bids where most datasets are a few GB at most, neurophys data is being generated at much larger sizes easily hitting TBs in many situations.

our current thinking in dandi is driven to support a changing landscape in neurophysiology over the next 5 - 10 years a bit more than what people have been doing traditionally. we have been asked multiple times to store hundreds of TBs of data for single datasets. that's a scale at which most people will not be looking at filesystem organization. we can always provide a view of the underlying data through a metadata remapping model. i personally consider bids to be a practical and efficient view of a more complex information model (one that we are capturing in a more structured manner in NIDM for example).

@yarikoptic
Copy link
Member

FWIW on the aspect of

sub-GUID/ses-YYYYMMDD_XXX_BBBB/rawdata

which exemplifies the suggestion to have filename itself without reflection of metadata encoded in the upper directories names.

I originally also was arguing for BIDS to not duplicate sub- and ses- in the filename since they are already present in the directory names. Main argumentation IIRC toward keeping them in the filename as well was: make file names unambiguous across subjects/sessions so they could be copied/shared (with collaborators) etc without causing confusion. Moreover, pragmatically, many tools display only the filename, not the full path to the file, in their listing of the files etc. So when I load multiple files in some viewer, I can tell from the filename alone to which subject they belong etc, instead of wondering from what that "rawdata" among 10 came from. In summary: I found it a bit more cumbersome but useful to have sub-, ses- in the filenames as well.

@tgbugs
Copy link

tgbugs commented Sep 28, 2020

Some examples of the hierarchy we came up with are in this folder https://github.com/SciCrunch/sparc-curation/tree/master/resources/DatasetTemplate/primary.

At the top level we have 3 folders for different stages of data processing. source primary and derivative.

  • Source is for raw, unprocessed data.
  • Primary is where the metadata about subjects and samples is expected to be mapped, it is the data that is expected to be the most useful for reuse.
  • Derivative is where things like figures, visualizations, or analyses results can be deposited.

Within primary the file folder level entities that we identified are
pool, subject, sample, and performance.

  • Pools are collective entities where there are multiple subjects or samples that are present in a single data structure. That could be because the data is from a deidentified population, or simply because of the way that data was collected. NOTE these are NOT experimental groups (e.g. control or treatment).
  • Subjects are whole biological organisms, mouse, human, etc.
  • Samples are extracted parts of whole biological organisms.
  • Performances are a generalization of what BIDS calls a session. The are more general in that they represent the performance of any experimental protocol. In the open minds work they are called protocol executions, however I strongly suggest not using that term given the rather morbid alternate meaning that "execution" can have in the context of animal research. This also makes it easier to distinguish performance of experimental protocols from execution of computer programs or simulations.

There is a nasty issue with splitting subjects and samples, which is that they really occupy the same location in the data structure, and they are overly narrow in that they are not sufficient to capture higher levels of organization such as populations or subject groups. Therefore I suggest using specimen as a top level entity which can capture populations, subjects, samples, etc. This leads to sparse tables, but it vastly simplifies the data model and the implementation of the validator code (the split samples/subjects is a pain to implement and maintain). I also suggest this because sample type is nearly always a required field, and thus can be used to distinguish between not only sample types, but also individual subjects, populations, etc. without further complicating the model. Happy to discuss this at length.

We don't have any requirement for the folder naming conventions aside from the fact that they should not have spaces in them. We are also suggesting not doing unfriendly things like giving a subject an identifier with a name that evokes a sample, e.g. heart-2 is a very bad identifier for a mouse. We map the type of entity the folder represents based on the metadata sheets, not by the prefix, with an exception for performance, where perf-X is a required prefix (pool- might become required in the future, we don't have many examples of datasets that have used pools yet). We will likely continue to revisit these design decisions in the future. One note is that many groups have simply followed the convention of using sub- and sam- as prefixes. One other issue that we have encountered is that researchers often reuse sample ids across subjects (which is impossible, and and if it is pooling then pool- should be used, not subject), so we have to construct a primary key from subject id + sample id, and sometimes investigators do that themselves, so you wind up with keys that look like sub-x_sub-x-sam-y. There isn't an easy way around this unless you can get a validator deployed locally in labs prior to data submission.

@jcolomb
Copy link

jcolomb commented Nov 25, 2020

about project/dataset.

With some colleagues, we are working on a folder template for projects inside the gin-tonic project. We are working on our first report. One thing interesting here, is that we got a similar feedback: one project is made of different experiments. We plan on creating a different data folder for each experiment/dataset. We did not go into the details of defining one experiment (yet) or the sub-structure of the data folder. But how it will be defined (example cases: same method but different animals, same animal used but different method, same method and same animal but re-tested at a different age, different animals and different methods but made for the same purpose,...)

raw versus derived data question.

I like the Source, Primary, Derivative distinction. I have heard that it is a problem with BIDS: researcher are supposed to save and archive the source data, but the BIDS formatted data is what go in the Primary folder here (and BIDS has a raw/derived data distinction..). Derivative would be everything that can be trashed, because one could reproduce it, right? But this would mean that this distinction would only make sense if it is higher order folders, and that this project would define the structure of the primary folder. This would also mean that one can use the source folder to feed data the way the researcher would see fit, and get a data curation (manual or automatic) process to change the source data into the form we want in the primary folder. In terms of data management, it would mean source is archived, Primary is the FAIR (open) version of the data, Derivative is trashed upon project completion.

(not that what I mean with derivative here would still be data files, like pivot table, summaries or similar files. Figures and analysis should not be saved with the data)

subject/sample versus session

I still think that for animal study, getting the subject as high level folder make little sense. As noted above by @tgbugs, one file has often data from multiple subjects, animals are often tested in groups (especially in non-rodent research) or data from different subject are recorded in the same spreadsheet/video. I do not see any data management reason to split data by animal/subject. I would be curious to know why BIDS went for subject in the highest level, I would guess it is because people were used to emphasize the human subject in early studies ? (which would call against this structure for animal research).

Misc

Basic tips in file naming is to have unique file name, the place where the file is present should not be necessary to derive information about the file. Exception can be made for readme files.

IgfGUID is meant by this: https://de.wikipedia.org/wiki/Globally_Unique_Identifier, then it would make the name of the file too long. An internally used, shorter ID (RFID chip number, animal ID used in the animal facility,..) would make more sense. As long as there is metadata describing the ID used, we should be safe, while keeping some human readability.

The same survey about gin tonic made us think that researchers tend to prefer flat structures.

In another project, I also used a strategy where the metadata would indicate the file path, so that the structure is irrelevant for the data analysis (where the code read the metadata to access the data anyway). It is a pretty simple way to analyse data from different source without having to move or rename files manually...

@tgbugs
Copy link

tgbugs commented Nov 25, 2020

I have been implementing validation of the SPARC Data Structure directory structure and have some further notes as a result of the process.

  1. If you do not include sub- sam- prefixes in identifiers then it is not possible to statically verify that the structure of the hierarchy is putatively correct without also having the specimen metadata files on hand with the types for those identifiers.
  2. This leads to a question of whether the type prefix should or needs to be included in the identifier in the specimen metadata. On the one hand telling users that they just need to match the identifiers is simpler than telling them to remove the type prefixes. On the other hand users almost always have their own unprefixed identifiers and forcing the type prefix into specimen identifiers creates a massive waste of space.
  3. We have allowed sample identifiers to be non-unique and local to their subject. This was a big mistake. Investigators must provide a unique identifier for samples across all subjects otherwise there are countless edge cases that complicate the implementation and make it hard to provide consistent behavior (consider trying to model a transplantation study where subject-1 sample-1 is transplanted into subject-2 and then dissected out again later as subject-2 sample-1).

@satra
Copy link
Member

satra commented Nov 25, 2020

@SylvainTakerkart - thanks for pushing this along. in addition to @tgbugs comments, we (in DANDI) have been working on the data/information model representing objects of interest (datasets, individual objects) with serialization to disk as a transform of that model. part of this consideration is driven by the sizes of individual datafiles (in the TBs range and growing) that we are expecting over the next year and our changing needs to support on the fly access to data or pieces of data over a network call. we are still tweaking the model, which would be serializable to disk, and will be releasing an updated API server by the end of the year together with datalad datasets. so any discussion of serialization is indeed a good thing to continue, but wanted to put the object + metadata access consideration in play as well.

@robertoostenveld
Copy link

Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently?

Or imagine an already shared dataset: how would that look like if you were to reorganize it in the newly proposed structure? Would all data and metadata find a logical place in the new structure, is there (meta)data that does not fit, or is there (meta)data missing from the already shared dataset that you consider crucial?

@jmbudd
Copy link

jmbudd commented Dec 9, 2020

Hi, a few thoughts on the discussion:

  1. On a practical note maybe we should decouple discussion of directory organization from discussion about file naming: decide on directory organization first and then file naming next? There may be overlap but it may help simplify the decision-making process. Later we might try to enumerate the options under each to make any voting easier, e.g. subject vs session first etc.

  2. Completely agree with the principle of separation into raw data and derived/processed data for reasons of curation as well as sharing.

  3. The discussion, though informed by a great deal of collective experience, may be becoming too abstract at this stage. There may be risk that if we don't at the outset capture a wide enough range of most frequent neuroscience use-cases then converging of a common format for the range of different experimental designs and modalities may become overly time consuming with many iterations.

  4. To make the discussion more concrete, following on from @robertoostenveld suggestion, we might collect together a range of existing datasets to represent the most common use-cases we expect the format should cope with to see how they might fit with any scheme we come up with. The Buzsaki lab, for example, has for many years shared their rodent hippocampal recording datasets many of which are available at crcns.org. In the process they have created a custom (session-based) data structure for their increasingly complicated combined ephys and behavioural experiments (see https://buzsakilab.com/wp/data-structure-and-format/). So perhaps we could consider using one or more of their datasets as a use-case for rodent in vivo behavioural recording datasets and add a few others from other domains such as VSD to encompass more of the general problem we attempting to solve?

@robertoostenveld
Copy link

Regarding 3 and 4: you may want to consider the https://en.wikipedia.org/wiki/Pareto_principle which states (or claims) that about 80% of the value comes from 20% of the cases, or 80% of the revenue comes from 20% of the customers, or 20% of goods in a shop make up 80% of the sales, etc.

So rather than making a very broad overview of all possible cases, which will be subsequently very difficult to work with, you could try to identiy those 20% of the cases that represent the most value. I.e., rather than investing in dealing with (relative) exceptional cases, first invest in the bulk.

@jmbudd
Copy link

jmbudd commented Dec 9, 2020

When it comes to the number of use-cases I suggested one or more from rat ephys in vivo and a few from other domains. So, to keep things manageable, say a total of around five or six use-cases might be enough to start with and each could be selected as the most representative use-case by those with most knowledge of each domain.

@satra
Copy link
Member

satra commented Dec 9, 2020

Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently?

@robertoostenveld - didn't see this earlier. you can look at the datasets on https://dandiarchive.org (there are about 38 dandisets covering 4 species, intra/extracellular and optical recordings)

@SylvainTakerkart
Copy link
Collaborator Author

Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently?

Yes, this is an important question!!

This dataset (utah array ephys, several sessions, several animals, exhaustive metadata) could also be useful:
https://doi.org/10.1038/sdata.2018.55
https://doi.gin.g-node.org/10.12751/g-node.f83565/

@SylvainTakerkart
Copy link
Collaborator Author

Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently?

actually, I've opened a new thread to centralize a list of potentially useful datasets: #7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants