Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metadata handling #89

Open
jlaehne opened this issue Feb 16, 2023 · 5 comments
Open

Improve metadata handling #89

jlaehne opened this issue Feb 16, 2023 · 5 comments

Comments

@jlaehne
Copy link
Contributor

jlaehne commented Feb 16, 2023

Describe the functionality you would like to see.

As brought up by @francisco-dlp in LumiSpy/lumispy#53 (comment), it would be desirable to have a more universal metadata handling. Currently, metadata is mapped from original_metadata in every file_reader independently following the HyperSpy conventions. If other packages would want to built on RosettaSciIO, this is not the most convenient. Also it does include a lot of redundant code. Instead, we could for example use something like yaml files to define the mapping, and then each folder could include a hyperspy.yaml, but potentially also other mapping files for other applications.

Of course, metadata mapping is not always 1:1 (node from one tree is directly mapped to position in other metadata tree), which can be done using a basic dictionary. The mapping definition would need to include several extra situations:

  • if/elif/else like statement, where a certain field in original_metadata can decide which other field is mapped or what string/value is set in a certain node of metadata
  • processing the content of a field by python (e.g. one line code segments), such as unit conversion, calculation of an overall exposure time from multiple acquisitions (number of frames x time per frame)

The developers of the https://github.com/nomad-coe/nomad repository/ELN have implemented a similar functionality based on what they call "schemas". Maybe, we can team up with them @markus1978, @haltugyildirim to implement such a mapping in RosettaSciIO, as the possibility to read in a number of (partly binary) data formats provided by RosettaSciIO should in turn be valuable to Nomad in order to support a broader range of experiments and to integrate processing via e.g. HyperSpy.

Additional information

Should not hold back an initial release, but should be on the roadmap.

@francisco-dlp
Copy link
Member

Thanks @jlaehne for bringing back this important topic.

Indeed RosettaSciIO does map all metadata to HyperSpy's metadata specification. This comes with the advantage that it can translate all mapped metadata across formats (hence the link with the Rosetta Stone), but it is an overhead when this is not required. Therefore, it should be an optional feature (task 1).

As you rightly point out, the mapping to HyperSpy's metadata specification is not done very smartly. Ideally, one should be able to specify the mapping using an easy to maintain mapping specification file, e.g. in yaml (task 2). The task is far from trivial, and it is of interest beyond RosettaSciIO, so ideally it should be performed by an independent tool. Nomad's schemas seem like a good candidate.

Finally, HyperSpy's metadata specification is defined in the User Guide. It would be better to defined the metadata using e.g. Nexus' specifications or simply switch to Nexus' EM microscopy format (task 3).

@ericpre
Copy link
Member

ericpre commented Feb 17, 2023

Now that there is nexus definition for electron microscopy, it would be great to use it and provide feedback on its usability.

@jat255
Copy link
Member

jat255 commented Apr 13, 2023

I wanted to share a few links to maybe push this discussion along (I think this is a great idea and would be interested in helping work on it, as interoperability is a critical part of a mature data ecosystem):

  • EM Glossary: https://codebase.helmholtz.cloud/em_glossary/em_glossary
    • The EM glossary group is working on a community standardized controlled vocabulary for electron microscopy terms. The NexusEM implementation is related to this effort (and the terms defined in the glossary I believe are informing what goes into NexusEM)
    • In some initial conversations with members of that group, they're very interested to see software-level implementations of the glossary to see how it works in "the real world", and would be happy to have HyperSpy/RScIO involved
  • Scythe EM metadata schema: https://github.com/materials-data-facility/scythe/blob/master/scythe/schemas/electron_microscopy.json
    • The Scythe project's goal is to provide a shared resource of metadata 'extractors' (not EM specific) that are each controlled by a schema
    • I worked briefly on implementing a prototype EM JSONSchema (linked above) and extractor, but the project as a whole has stalled a bit due to lack of person-power; regardless of technology (i.e. JSONSchema or something else), I think formally specifying our metadata schema (or using another's) would be very powerful as it allows for real-time and automated validation of metadata structures
    • My example makes heavy use of HyperSpy for the mechanics of reading metadata, but the values to map in/out are manually written at this point and very simplistic. I like the idea of some sort of standardized .yaml file or something else to define mappings.
  • MaRDA metadata extraction working group: https://github.com/marda-alliance/metadata_extractors
    • This is not an actual implementation of anything, but is a recently launched effort from the Materials Research Data Alliance to attempt to coordinate efforts on metadata extraction (just wanted to make people aware of it)

@CSSFrancis
Copy link
Member

@jat255 These are all great resources. It does seem like there is a fair bit of duplication of efforts occurring in the community and it would be good to get ahead of that. Is there anyway we can bring more people into the fold/ integrate packages?

Developer time in the microscopy community seems to be very limited so anything we can do to reduce duplication is very valuable!

Maybe a meeting with all interested parties would help to get the ball rolling.

@CSSFrancis
Copy link
Member

@jat255 It seems like it might be also worthwhile to send someone to a MaDRA meeting. I can attend, but don't know if I am the most qualified person to represent rosettasciio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants