Adding open source EBSD datasets #411

argerlt · 2022-12-01T19:21:28Z

This is a continuation from #406, but with a slightly expanded scope.

I would like to add an open source datasets page to ORIX, similar to Kikuchipy or Pyxem. In particular, I'm thinking of three useful datasets:

the US Air Force Research Lab AF96 datasests, six 2100 by 1000 ebsd scans of a Martensitic steel which are often split up into a set of 90 overlapping 512 by 512 scans. Available through Globus, uses a CC-BY 4.0 license
the Dream3d IN100 dataset of serial sectioned 3d EBSD scans, 189x189x117 pixels in size, stored as 117 .ang files. Available through the BlueQuartz websit, has a BSD open source license
The MTEX ebsd files, used in all the MTEX examples. available through github, has GPL license

The EASY thing would be to just add an open_databases.rst page that looks something like this, but better, ideally with a few pictures (heavily copied from pyxem):

===========================
Open datasets and workflows
===========================

Here are some open datasets which are helpful for running some of the functions using real data:

#. `AF96 Martensitic Steel < https://doi.org/10.18126/iv89-3293>`_ A collection of 6 EBSD scans, all 2 million pixels in size, of AF96 Martensitic steel. Details on the exact composition, preperation, and collection of this data can be found in the following two publications:
        https://doi.org/10.1016/j.dib.2019.104471
        https://doi.org/10.1016/j.matchar.2019.109835

#. `Dream 3D's Inconel 100 serial section scans <http://dream3d.bluequartz.net/Data/>`_, A set of 117 2D ang files which represent a 3D 189x189x117 pixel cubical EBSD dataset.

#. 'MTEX's EBSD example ebsd files<https://github.com/mtex-toolbox/mtex/tree/develop/data/EBSD>`_

However, I think the problem here is new users want something they can learn with immediately, as opposed to learning ORIX's IO and having to fiddle with different import methods until they find the right way to import files into CrystalMap objects. Additionally, Globus is a massive pain to get downloads from (this is why i actually made the original PR, it was far too inconvenient to get the AF96 datasets for new users).

In this respect, as a new user, I loved how in MTEX I could just type `mtexdata ferrite' and I instantly had an MTEX ebsd object. Not an ang, or a .oim I had to then correctly import, but an actual pre-imported object.

To that end, for at least the first two examples, it think it would be useful to have ORIX .h5 versions as files hosted on Zenodo, then include a snippet of code in the download example that can download and then import those files. Bonus if it imports _fetcher from orix.data so that duplicats of files aren't downloaded, and can just be quickly loaded from the local cache.

Thoughts? @hakonanes @pc494

The text was updated successfully, but these errors were encountered:

hakonanes · 2022-12-04T13:11:25Z

Thanks for raising this issue @argerlt. As mentioned elsewhere, I agree that we should add this open datasets page to the docs.

I wasn't aware of the openly available 3D dataset from Dream3D. These would be ideal to use in testing a potential 3D functionality down the line.

I think the problem here is new users want something they can learn with immediately, as opposed to learning ORIX's IO and having to fiddle with different import methods until they find the right way to import files into CrystalMap objects.

This is a valid point. I opened #412 to address this.

In this respect, as a new user, I loved how in MTEX I could just type `mtexdata ferrite' and I instantly had an MTEX ebsd object. Not an ang, or a .oim I had to then correctly import, but an actual pre-imported object.

I assume you are aware of the two small test datasets orix is packaged with? If not see e.g. orix.data.sdss_ferrite_austenite(). We could add more datasets to the data module. But, I strongly believe we should only add datasets used in a doc example or tutorial to show off some functionality (or in tests). I believe all datasets in mtexdata are used in MTEX' docs.

To that end, for at least the first two examples, it think it would be useful to have ORIX .h5 versions as files hosted on Zenodo, then include a snippet of code in the download example that can download and then import those files.

I agree that easy access to such established open test datasets is important. Instead of converting these datasets to orix' HDF5 file format though, we should add readers for the format they are already stored in.

I could for example easily import the first slice of the Dream3D nickel dataset

>>> from orix import io, plot
>>> xmap = io.load("Slice_001.ang")
/home/hakon/kode/orix/orix/io/plugins/ang.py:268: UserWarning: Number of columns, 10, in the file is not equal to the expected number of columns, 14, for the 
assumed vendor 'tsl'. Will therefore assume the following columns: euler1, euler2, euler3, x, y, unknown1, unknown2, phase_id, unknown3, unknown4, etc.
  warnings.warn(
>>> xmap
Phase    Orientations    Name  Space group  Point group  Proper point group     Color
    0  37989 (100.0%)  Nickel         None          432                 432  tab:blue
Properties: unknown1, unknown2, unknown3, unknown4
Scan unit: nm
>>> xmap.scan_unit = "um"
>>> ipfkey = plot.IPFColorKeyTSL(xmap.phases[0].point_group)
>>> rgb_z = ipfkey.orientation2color(xmap.orientations)
>>> xmap.plot(rgb_z, overlay="unknown1", remove_padding=True)  # IQ overlay

And the first of the raw AF96 datasets (the raw FOV 1)

>>> from orix import io, plot
>>> xmap = io.load("Field of view 1_EBSD data_Raw.ang")
/home/hakon/kode/orix/orix/io/plugins/ang.py:268: UserWarning: Number of columns, 10, in the file is not equal to the expected number of columns, 14, for the 
assumed vendor 'tsl'. Will therefore assume the following columns: euler1, euler2, euler3, x, y, unknown1, unknown2, phase_id, unknown3, unknown4, etc.
  warnings.warn(
>>> xmap
Phase      Orientations       Name  Space group  Point group  Proper point group       Color
    0         29 (0.0%)  Austenite         None          432                 432  tab:orange
    1      14900 (0.7%)    Ferrite         None          432                 432    tab:blue
    2   2202639 (99.3%)       None         None         None                None   tab:green
Properties: unknown1, unknown2, unknown3, unknown4
Scan unit: nm
>>> xmap.scan_unit = "um"
>>> ipfkey = plot.IPFColorKeyTSL(xmap.phases[1].point_group)
>>> rgb_z = ipfkey.orientation2color(xmap.rotations)  # Hack, should be xmap["Ferrite"].orientations
>>> xmap.plot(rgb_z, overlay="unknown1", remove_padding=True)

(orix does not read either of these correctly ("nm" instead of "um", and the phase IDs are incorrect for the AF96 dataset). I opened #413 to track these bugs.)

The AF96 dataset is a nice test dataset because it is large (> 2 million points, 205 MB). We could use this to test how well our algorithms perform in terms of memory, CPU load and time. As for the Dream3D dataset, if we use this dataset in the docs, we can make it available via orix.data.

I anticipate that the orix HDF5 format will change in the future as more people use it and suggest improvements. I therefore do not want to upload files in this format to any permanent source, like Zenodo.

Finally, regarding the MTEX datasets, I suggest we only link to these in the docs. Based on our discussion in #389, I think we should restrict the use of other GPL code to a minimum, ideally none, if we ever hope to make the license of orix or parts of orix more permissive (say BSD3). I think the best way forward is to work towards better interoperability between orix and MTEX (and other similar softwares, like Dream3D).

argerlt · 2022-12-05T18:26:28Z

I assume you are aware of the two small test datasets orix is packaged with?

Yes, functions like these were exactly what I was trying to mimic in #409 . Your point on "orix data is for examples only" is a valid one though, so I'm just thinking about how to best mimic this but as a 10 line code snipped in open_datasets.rst

I anticipate that the orix HDF5 format will change in the future as more people use it and suggest improvements. I therefore do not want to upload files in this format to any permanent source, like Zenodo.

This is actually close to what MTEX does now. it creates a cache (similar to pooch), downloads the original files if they aren't already in the cache, then converts them into an MTEX EBSD object. it then saves the EBSD object as a .mat file with a note about the version of MTEX used, and if the same MTEX version tried to use that data again, it just loads the .mat file instead.

Finally, regarding the MTEX datasets, I suggest we only link to these in the docs. Based on our discussion in #389, I think we should restrict the use of other GPL code to a minimum, ideally none, if we ever hope to make the license of orix or parts of orix more permissive (say BSD3).

Yup, I agree.

I think the best way forward is to work towards better interoperability between orix and MTEX (and other similar softwares, like Dream3D).

On this note, it's worth mentioning that Dream3D is produced by BlueQuartz, and the small_in100 datasets were collected by Mike Groeber (related paper here) while he was working there. Groeber is a co-author on everyone's favorite 2015 Rowenhorst et al paper, and Dream3D uses the exact same rotation representation conventions as Orix for all it's internal calculations(code here). Dream3D is also working on improving its python API, and is now installable via Conda, so cross-compatibility might be very realistic in the near future. It also has a pretty excellent EbsdLib for reading various EBSD formats, but it's in cpp and GUI-centric, so maybe not useful for the ORIX team.

Alright, I think I have some ways of doing this that will make everyone happy. I will write an example "Open_Datasets.rst" file and post it here, and we can go from there.

argerlt · 2022-12-05T18:26:48Z

I assume you are aware of the two small test datasets orix is packaged with?

Yes, functions like these were exactly what I was trying to mimic in #409 . Your point on "orix data is for examples only" is a valid one though, so I'm just thinking about how to best mimic this but as a 10 line code snipped in open_datasets.rst

I anticipate that the orix HDF5 format will change in the future as more people use it and suggest improvements. I therefore do not want to upload files in this format to any permanent source, like Zenodo.

This is actually close to what MTEX does now. it creates a cache (similar to pooch), downloads the original files if they aren't already in the cache, then converts them into an MTEX EBSD object. it then saves the EBSD object as a .mat file with a note about the version of MTEX used, and if the same MTEX version tried to use that data again, it just loads the .mat file instead.

Finally, regarding the MTEX datasets, I suggest we only link to these in the docs. Based on our discussion in #389, I think we should restrict the use of other GPL code to a minimum, ideally none, if we ever hope to make the license of orix or parts of orix more permissive (say BSD3).

Yup, I agree.

I think the best way forward is to work towards better interoperability between orix and MTEX (and other similar softwares, like Dream3D).

On this note, it's worth mentioning that Dream3D is produced by BlueQuartz, and the small_in100 datasets were collected by Mike Groeber (related paper here) while he was working there. Groeber is a co-author on everyone's favorite 2015 Rowenhorst et al paper, and Dream3D uses the exact same rotation representation conventions as Orix for all it's internal calculations(code here). Dream3D is also working on improving its python API, and is now installable via Conda, so cross-compatibility might be very realistic in the near future. It also has a pretty excellent EbsdLib for reading various EBSD formats, but it's in cpp and GUI-centric, so maybe not useful for the ORIX team.

Alright, I think I have some ways of doing this that will make everyone happy. I will write an example "Open_Datasets.rst" draftand post it here, and we can go from there.

argerlt · 2022-12-12T23:52:10Z

Below is a draft of how I would suggest an open_dataset.rst file be done. I have yet to make the Dream3D Zenodo file page as I am still trying to figure out the license information, but the AF96 example is complete, so people can try out the download function and give feedback. Also, not sure how best to add pictures to .rst files, but would be nice to add the ones above as previews.

orix/doc/open_datasets.rst:


================
Open Datasets
================

Orix includes several small data sets intended specifically for testing and tutorial 
purposes within the  `:mod:orix.data` module. 

Additionaly, this file contains a list of openly available datasets hosted on Zenodo.
datasets can be downloaded using the following code:
.. code-block::

    import os, copy, zipfile, glob, pooch
    from orix.data import _fetcher


    def download_from_Zenodo(zenodo_DOI, filename, md5=None):
        """Downloads requested zenodo datasets into the local orix cache if not
        previously downlaoded, and unzips sets of data if necessary."""
        cache_path = os.sep.join([str(_fetcher.path), filename])
        url = "https://zenodo.org/record/{}/files/{}".format(zenodo_DOI, filename)
        # add local path and url to deep copy of orix's default pooch fetcher
        zenodo_fetcher = copy.deepcopy(_fetcher)
        zenodo_fetcher.urls[filename] = url
        zenodo_fetcher.registry[filename] = md5
        # Download if downloadable
        download = pooch.HTTPDownloader(progressbar=True)
        path = zenodo_fetcher.fetch(filename, downloader=download)
        if filename[-4:] == ".zip":
           zipfile.ZipFile(cache_path, 'r').extractall(zenodo_fetcher.path)
            return glob.glob(cache_path[:-4] + os.sep+"*")
        else:
            return path

Note that these works are not part of Orix themselves, and the original sources
should be properly acknowledged in any derivative work.

Users wishing to add their own datasets to this list are encouraged to open 
a related issue on the `orix github page <https://github.com/pyxem/orix/issues>`_. Please 
include the Zenodo DOI, copyright information, and preferred citation if available. 


AF96 Martensitic Steel
========================


a collection of five 2116x1048 pixel EBSD scans of AF96, originally released as 
part of the following Data in Brief:

    `Datasets acquired with correlative microscopy method for delineation of prior austenite grain boundaries and characterization of prior austenite grain size in a low-alloy high-performance steel <https://doi.org/10.1016/j.dib.2019.104471>`_

This data is under a Creative Commons licence, CC BY 4.0. Therefore, any work
using these datasets must credit the original authors, preferably by citing 
the paper listed above. Further details on the preperation of the samples 
can be found in the following publication:

    `Correlative microscopy for quantification of prior austenite grain size in AF9628 steel <https://doi.org/10.1016/j.matchar.2019.109835>`_

Copies of these datasets, as well as 40 smaller 512x512 scans taken from the larger
ones, can be found on `Zenodo <zenoodo.link>`_. These can be automatically downloaded
and converted to orix CrystalMaps as follows:
.. code-block::

    big_map_paths = download_from_Zenodo(7430395, "AF96_Large.zip", "md5:60c6eefd316e2747c721cd334ed3abaf")
    small_map_paths = download_from_Zenodo(7430395, "AF96_Small.zip", "md5:01890e210bcbc18516c571585452ed26 ")

    # load a single small map
    small_xmap = io.load(small_scan_paths[0])
    # load as single large map
    large_xmap = io.load(large_scan_paths[0])
    # load a list of all 5 large maps (can take several minutes)
    large_xmaps = [io.load(path) for path in large_scan_paths]

# Include Picture?

Inconel100 3D-EBSD
========================


A collection of 117 EBSD scans, each 189x189 pixels in size. Scans were taken
from successive layers of a serially sectioned piece of Inconell 100, and used for
validation purposes for Blue Quartz's open source software package `Dream3D <http://dream3d.bluequartz.net/>`_.
This dataset was first reported on in the following publication:

    `3D reconstruction and characterization of polycrystalline microstructures using a FIB–SEM system <https://doi.org/10.1016/j.matchar.2006.01.019>`_

Additionally, Dream3D contains several tutorials for visualizing and processing this
dataset `found here <http://www.dream3d.io/2_Tutorials/EBSDReconstruction/>`_.

a copy of this dataset can be found on `Zenodo <I havent made a link yet>`_. These can be 
automatically downloaded and converted to orix CrystalMaps as follows:
.. code-block::

    in100_scan_paths = download_from_Zenodo(12345678910,'Small_IN100.zip',hash)

    # load a single map
    in100_xmap = io.load(in100_scan_paths[0])
    # load a list of all 117 maps(can take several minutes)
    in100_xmaps = [io.load(path) for path in in100_scan_paths]

# Include Picture?

hakonanes added the documentation Relates to the documentation label Dec 2, 2022

This was referenced Dec 4, 2022

Easier creation of a CrystalMap #412

Open

Incorrect reading of two openly available test datasets in .ang file format #413

Closed

hakonanes mentioned this issue Dec 7, 2022

Allow reading of EDAX TSL .ang files with ten columns #416

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding open source EBSD datasets #411

Adding open source EBSD datasets #411

argerlt commented Dec 1, 2022 •

edited by hakonanes

hakonanes commented Dec 4, 2022

argerlt commented Dec 5, 2022

argerlt commented Dec 5, 2022

argerlt commented Dec 12, 2022

Adding open source EBSD datasets #411

Adding open source EBSD datasets #411

Comments

argerlt commented Dec 1, 2022 • edited by hakonanes

hakonanes commented Dec 4, 2022

argerlt commented Dec 5, 2022

argerlt commented Dec 5, 2022

argerlt commented Dec 12, 2022

argerlt commented Dec 1, 2022 •

edited by hakonanes