Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representing composite datasets #213

Open
simleo opened this issue Oct 26, 2022 · 5 comments
Open

Representing composite datasets #213

simleo opened this issue Oct 26, 2022 · 5 comments

Comments

@simleo
Copy link
Contributor

simleo commented Oct 26, 2022

This issue is the outcome of a discussion with @ilveroluca after last Tuesday's Workflow Run RO-Crate meeting, where we started wondering how to represent secondary files (as in CWL's secondaryFiles) in a Workflow RO-Crate.

The actual use case that gave rise to the discussion was the representation of a Mirax image, which consists of:

  • a main file whose name ends with .mrxs;
  • a directory in the same location as the main file, with the same name minus the extension.

The directory contains data files, an index file etc. In the CRS4 tissue/tumor prediction workflow, in order to have CWL pick up all these files, we're using secondaryFiles. However, those files are not really secondary, especially the data files, which contain the actual image data. Rather, all files together contribute to the same multi-file input dataset. An example of a format with a similar layout is Zarr.

The real question, then, is how to represent such a dataset in RO-Crate. In RO-Crate, a Dataset maps to a directory, while single files are represented by File (alias for MediaObject). What about mixes of files and directories? One of the solutions we discussed is to recommend using hasPart on the main file:

{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
	...
    ]
}

However, in RO-Crate File represents a single file, and the files listed in hasPart are not actually parts (byte chunks) of Mirax2-Fluorescence-2.mrxs, but rather of the same dataset that includes Mirax2-Fluorescence-2.mrxs. Moreover, not all formats clearly identify a "main" file: In Zarr, for instance, .zattrs and .zarray are both metadata files at the same level.

Another option could be to change the RO-Crate spec so that Dataset would map to a mix of files and directories, rather than a single directory. This is encompassed by the schema.org definition, which is very general. However, such a change at this point where several profiles and software packages already exist would be very disrupting, especially for tools.

Though I've used an imaging example, the problem of representing a mix of files and directories as a single entity is quite general, so I think RO-Crate should have an explicit recommendation for this. Using a nested crate seems overkill, and depending on the format there might not be a single containing directory for the metadata file.

In principle, one could use CreativeWork, but it's probably too general. It would be hard for tools to identify a multi-file dataset as such. Add a custom e.g. CompositeDataset type? Is there an existing type that could be a good fit instead? Collection? Anyone knows of existing attempts to represent such datasets in RO-Crate?

Another problem is what to put under @id, especially when there is no clearly identified "master" file, since all actual files would have to go under hasPart. Can internal references be used for data entities? E.g.:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "#Mirax2Fluorescence2"}
    ]
},
{
    "@id": "#Mirax2Fluorescence2",
    "@type": "CompositeDataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
	...
    ]
}
@ptsefton
Copy link
Contributor

ptsefton commented Oct 27, 2022

How about make the directory the dataset with a part that is outside of the directory

{
    "@id": "Mirax2-Fluorescence-2",
    "@type": "Dataset", <-- or Collection (where the @id would have to be #Mirax2-Fluorescence-2 or could be URI?)
   "mainEntity":   {"@id": "Mirax2-Fluorescence-2.mrxs",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
	...
    ]
}

@stain
Copy link
Contributor

stain commented Oct 27, 2022

Use https://schema.org/Collection as contextual entity (mentions from root) for grouping of data entities?

@ptsefton
Copy link
Contributor

Collection can have hasPart props

@simleo
Copy link
Contributor Author

simleo commented Nov 8, 2022

Summing up the discussion we had at the latest meeting and suggestions here, the representation of:

https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip

Would be:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
    ],
    "mentions": [
        {"@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip"}
    ]
},
{
    "@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip",
    "@type": "Collection",
    "mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
    ]
},
{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/",
    "@type": "Dataset",
}

Or, rather, one of the representations. One might use a local id for the collection (this dataset is on the web, but that might not always be the case) and/or choose to list every single file:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        ...
        {"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
    ],
    "mentions": [
        {"@id": "#Mirax2-Fluorescence-2"}
    ]
},
{
    "@id": "#Mirax2-Fluorescence-2",
    "@type": "Collection",
    "mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        ...
        {"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
    ]
},
{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Index.dat",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Slidedat.ini",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Data0000.dat",
    "@type": "File",
},
...
{
    "@id": "Mirax2-Fluorescence-2/Data0023.dat",
    "@type": "File",
}

Yet another possibility is to list every single file and the dataset, linking to the auxiliary files from hasPart in the latter:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        ...
        {"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
    ],
    "mentions": [
        {"@id": "#Mirax2-Fluorescence-2"}
    ]
},
{
    "@id": "#Mirax2-Fluorescence-2",
    "@type": "Collection",
    "mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
    ]
},
{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        ...
        {"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
    ]
},
{
    "@id": "Mirax2-Fluorescence-2/Index.dat",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Slidedat.ini",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Data0000.dat",
    "@type": "File",
},
...
{
    "@id": "Mirax2-Fluorescence-2/Data0023.dat",
    "@type": "File",
}

@simleo
Copy link
Contributor Author

simleo commented Dec 20, 2022

From @pauldg: some collections may not have a mainEntity, e.g. in Galaxy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants