Add .seq format for DE 16 and Celeritas Camera #11

CSSFrancis · 2022-08-15T14:16:48Z

Description of the change

This adds in support for reading the DE 16 and Celeritas cameras.

Some notes about the file format:
DE 16:

The DE 16 camera reads out to multiple files. A metadata file, dark, gain and a .seq file. These files all have the same naming scheme so I read all files in the same folder with the same naming scheme or allow for directly passing the files.
The data is in a binary format with each frame at some offset and a time stamp following the frame.

*Celeritas

Due to the speed at which this camera reads out data the camera is split in two a "top" and a "bottom" frame are both read concurrently.
These frames are also saved in a buffer. With multiple images saved in a big long image.
- This makes memory mapping this dataset a little bit harder as there isn't a constant stream of data, I would like to add support for using the distributed scheduler but that might have to wait.
- This buffer is saved in the XML file alongside the data. There may be a way to guess this buffer if given the XML file and the FPS of the camera.
- The time stamp is only recorded once every buffer.
etc.

Progress of the PR

Minimal example of the bug fix or the new feature

from rsciio.de import api
api.file_reader("test.seq") # read regular .seq

api.file_reader("test_Top_.seq", celeritas=True) # read celeritas .seq

…e and from the newer software packaged with the celeritas camera

…andling and xml reading

…reased test coverage slightly

… orientation

…int statements and improved error message

codecov · 2022-08-15T14:21:06Z

Codecov Report

Patch coverage: 90.16% and project coverage change: +0.20 🎉

Comparison is base (b045157) 84.95% compared to head (a340725) 85.15%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #11      +/-   ##
==========================================
+ Coverage   84.95%   85.15%   +0.20%     
==========================================
  Files          73       75       +2     
  Lines        8894     9250     +356     
  Branches     1955     2022      +67     
==========================================
+ Hits         7556     7877     +321     
- Misses        876      895      +19     
- Partials      462      478      +16

Impacted Files	Coverage Δ
rsciio/de/_api.py	`89.77% <89.77%> (ø)`
rsciio/utils/tools.py	`80.26% <90.90%> (+6.93%)`	⬆️
rsciio/de/__init__.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

CSSFrancis · 2022-08-15T14:26:11Z

@sk1p I know we have talked about adding support for the DE Celeritas camera to liberTEM and hyperspy. If you have the chance can you look over this PR? The hardest thing is dealing with the Segment prebuffer for the celeritas camera.

I wanted to add support for distributed scheduling using the scheme proposed by @uellue here but due to the nature of the prebuffer the data isn't evenly spaced in the binary file. This makes implementing this in a general way fairly difficult.

sk1p · 2022-08-15T17:20:51Z

I know we have talked about adding support for the DE Celeritas camera to liberTEM and hyperspy. If you have the chance can you look over this PR? The hardest thing is dealing with the Segment prebuffer for the celeritas camera.

I can have a look - I'd also like to try this with real data, did you manage to upload some to the drop link I gave you some time ago?

In general, what is this project's stance on testing with real input data? It could be possible to publish a set of (small-ish) reference data sets on i.e. zenodo and download those in CI runs.

I wanted to add support for distributed scheduling using the scheme proposed by @uellue here but due to the nature of the prebuffer the data isn't evenly spaced in the binary file. This makes implementing this in a general way fairly difficult.

Yeah - in case of uneven spacing, it's probably required to do a sparse search pass over the data, for example by reading the image headers at N positions in the whole data set, and mapping out where it can be split - if I understood you correctly. Or is the coarse structure evenly spaced, i.e. it's possible to calculate offsets to images just from their index?

Anyways, instead of just a straight mmap, there would need to be a function that decodes whatever is in the file to a numpy array. That's also something needed for quite many other formats, i.e. FRMS6, binary MIB, ...

CSSFrancis · 2022-08-15T18:08:33Z

I can have a look - I'd also like to try this with real data, did you manage to upload some to the drop link I gave you some time ago?

Right now the data is all hosted in the tests/de_data/celeritas_data folder. There are smallish (1-20 mb) datasets collected using a couple of different camera modes. These are probably the best data used for testing.

In general, what is this project's stance on testing with real input data? It could be possible to publish a set of (small-ish) reference data sets on i.e. zenodo and download those in CI runs.

We try to test with real input data as often as we can. That being said the data is included with the package and it might be better to host that somewhere else eventually. I was meaning to create an Issue regarding this.

Yeah - in case of uneven spacing, it's probably required to do a sparse search pass over the data, for example by reading the image headers at N positions in the whole data set, and mapping out where it can be split - if I understood you correctly. Or is the coarse structure evenly spaced, i.e. it's possible to calculate offsets to images just from their index?

So the data is structured like this:

So its not quite uneven, but the images are saved in chunks. You can calculate the image offset if you know the number of images in a buffer.

Anyways, instead of just a straight mmap, there would need to be a function that decodes whatever is in the file to a numpy array. That's also something needed for quite many other formats, i.e. FRMS6, binary MIB, ...

Any examples of how you do this? Can you just create function that maps a frame to a offset in the data and then just apply it?

jlaehne · 2022-08-16T11:07:22Z

Right now the data is all hosted in the tests/de_data/celeritas_data folder. There are smallish (1-20 mb) datasets collected using a couple of different camera modes. These are probably the best data used for testing.

The longterm idea is to host the files in the repo, but to exclude them from the installation, where they would just be downloaded on demand. I don't remember the name of the package that can do this @ericpre . But would be a good idea to create an issue to put it on the todo.

rsciio/de/api.py

sk1p · 2022-08-20T11:20:53Z

rsciio/utils/tools.py

+def parse_xml(file):
+    try:
+        tree = ET.parse(file)
+        xml_dict = {}
+        for i in tree.iter():
+            xml_dict[i.tag] = i.attrib
+        # clean_xml
+        for k1 in xml_dict:
+            for k2 in xml_dict[k1]:
+                if k2 == "Value":
+                    try:
+                        xml_dict[k1] = float(xml_dict[k1][k2])
+                    except ValueError:
+                        xml_dict[k1] = xml_dict[k1][k2]
+    except FileNotFoundError:
+        _logger.warning(
+            msg="File " + file + " not found. Please"
+            "move it to the same directory to read"
+            " the metadata "
+        )
+        return None
+    return xml_dict


Does it make sense to have this as a generic utility function? The flattening, cleaning and conversion performed here seems to be specific to the DE metadata XML format. Any reason not to re-use convert_xml_to_dict instead?

I see that convert_xml_to_dict doesn't cope well with human-readable XML, which can have both a .text and child nodes, so that may need fixes before it can be used.

sk1p · 2022-08-20T11:52:37Z

rsciio/de/api.py

+    ImageBitDepth: int
+        The bit depth of the image. This should be 16 in most cases
+    TrueImageSize: int
+        The size of each frame buffersin bytes.  This includes the time stamp and


Suggested change

The size of each frame buffersin bytes. This includes the time stamp and

The size of each frame buffers in bytes. This includes the time stamp and

sk1p · 2022-08-20T12:04:07Z

rsciio/de/api.py

+    top_mapped = np.memmap(top, offset=offset, dtype=dtypes, shape=total_buffer_frames)
+    bottom_mapped = np.memmap(
+        bottom, offset=offset, dtype=dtypes, shape=total_buffer_frames
+    )
+
+    if lazy:
+        top_mapped = da.from_array(top_mapped)
+        bottom_mapped = da.from_array(bottom_mapped)
+
+    array = np.concatenate(
+        [
+            np.flip(
+                top_mapped["Array"].reshape(-1, *top_mapped["Array"].shape[2:]), axis=1
+            ),
+            bottom_mapped["Array"].reshape(-1, *bottom_mapped["Array"].shape[2:]),
+        ],
+        1,
+    )


Any examples of how you do this? Can you just create function that maps a frame to a offset in the data and then just apply it?

I think this was the main point you were asking about. It's more of a mapping from a chunk slice to an array. Each array chunk is created from both the top and bottom memory map, which are only created inside of the delayed function. To structure this according to the dask docs on memory mapping, it could look like this (sketch, untested):

def mmap_load_chunk(top, bottom, shape, dtype, offset, sl): top_map = np.memmap(top, mode='r', shape=shape, dtype=dtype, offset=offset)["Array"] top_flat = top_map.reshape(-1, *top_map.shape[2:]) top_sliced = top_flat[sl] bottom_map = np.memmap(bottom, mode='r', shape=shape, dtype=dtype, offset=offset)["Array"] bottom_flat = bottom_map.reshape(-1, *bottom_map.shape[2:]) bottom_sliced = bottom_flat[sl] return np.concatenate([ np.flip(top_sliced, axis=1), bottom_sliced, ], 1)

(with mmap_dask_array adjusted accordingly)

Does this make sense?

As the np.concatenate+np.flip operation does touch a sizable chunk of data, it may be efficient to replace it with a numba function that also inlines the application of dark_img/gain_img in addition to flipping/concatenation for cache efficiency.

…on-distributed implementations.

Formatting: Applied `black`

…new spec.

…code for easier readability.

…irectly point to files instead of using glob.

…tead of a dict

… reading.

…it__` to align with hyperspy#62

…rage

# Conflicts: # docs/supported_formats/de.rst # rsciio/de/specifications.yaml # rsciio/tests/test_de.py # rsciio/utils/tools.py

CSSFrancis added 14 commits August 15, 2022 08:59

New Feature:Added in specification for DE .seq format

0cd7aae

New Feature:Added in generic methods for reading binary data

351f6e3

Testing: Added tests using large files from the old streampix softwar…

b5b6dd6

…e and from the newer software packaged with the celeritas camera

NewFeatures: Added code for reading full seq files, better metadata h…

6c36f1c

…andling and xml reading

BugFix: Proper accounting for Buffer in Celeritas File Format

54b4184

Refactor: Created a Class for loading .seq files

d7aacdd

Refactor: Fixed Celeritas Reader class and implemented memmaping. Inc…

9e29ad0

…reased test coverage slightly

Testing: Cleaned up testing. Small changes to code fixing errors with…

c16ce87

… orientation

Testing: Added additional tests for different file formats

2806124

BugFix: Fixed lazy loading

f0661c3

Testing: Added testing for navshape larger than file

c12d5e6

Documentation: Improved documentation for .seq format

29d45da

Formatting: Black Formatting Applied

cde18b3

Formatting: added rsciio header instead of hyperspy header Removed pr…

f118e2d

…int statements and improved error message

Bugfix: Fixed improper reading of axes size in celeritas

151efa9

jlaehne added status: WIP type: new format labels Aug 15, 2022

CSSFrancis marked this pull request as draft August 15, 2022 15:59

ericpre mentioned this pull request Aug 16, 2022

Use pooch library to manage test files #14

Closed

sk1p reviewed Aug 20, 2022

View reviewed changes

ericpre mentioned this pull request Aug 22, 2022

.Seq File Format Loading hyperspy/hyperspy#2934

Closed

CSSFrancis added 4 commits August 24, 2022 16:21

Bugfix: XML File is recursively read

d4b42cb

NewFeature: Added in distributed Processing

256745d

Testing: Added tests for writing extra frames

af28007

Formatting: Ran black on changed files

ca9c437

CSSFrancis added 17 commits April 6, 2023 09:36

NewFeature: Replaced reading seq file with generic reader.

ac5a9bf

BugFix: Fixed generic .seq reading. Added tests for distributed and n…

7284ee3

…on-distributed implementations.

BugFix: Reduced tests. No longer testing adding blank frames.

6025b39

BugFix: Bad Pixel map now saved with metadata.

e009ad5

Formatting: Applied `black`

BugFix: Changed specifications.ymal name and aliases to fit with …

c5a4fdd

…new spec.

BugFix/Reformatting: used nav_shape is not None vs =!. Reformatted …

0325daa

…code for easier readability.

Coverage: Increased test coverage to handle warnings and errors and d…

98dd0c1

…irectly point to files instead of using glob.

Documentation: Added documentation for .seq format

c9a7803

Formatting: Applied black (2022 version)

b0a174a

BugFix: Changed memmap to be read only with flag "r"

17c8373

BugFix: Fixed Hyperspy loading rsciio.de.read_file returns a list ins…

d187a5a

…tead of a dict

Bugfixes: reduced errors with rewriting data and cleaned up some file…

7f25d01

… reading.

Bugfixes: Added support for passing in chunks to memmaped datasets.

cca3704

Doc Refactor: Cleaned up documentation and refactored `rsciio.de.__in…

b5a94e1

…it__` to align with hyperspy#62

Testing: Skip test if hyperspy isn't installed and improved test cove…

73c1745

…rage

Doc: Added to changelog for hyperspy#11.

a1c8506

BugFix: Got rid of rechunking in distributed loading

f16dacf

CSSFrancis force-pushed the add_seq branch from b2f6eb9 to 7dfa735 Compare April 6, 2023 14:40

Documentation: Improved documentation for loading .seq files

a340725

CSSFrancis force-pushed the add_seq branch from 7dfa735 to a340725 Compare April 6, 2023 14:59

CSSFrancis added status: needs review and removed status: WIP labels Apr 6, 2023

CSSFrancis mentioned this pull request Apr 19, 2023

Calling Custom API's and Creating a Workspace LiberTEM/LiberTEM-live#53

Open

CSSFrancis mentioned this pull request May 15, 2023

Unification of XML to dict list tree translation #113

Open

4 tasks

jlaehne mentioned this pull request Jun 2, 2023

Initial Release #99

Closed

19 tasks

CSSFrancis added status: waiting for author and removed status: needs review labels Jun 10, 2023

Merge remote-tracking branch 'origin/add_seq' into add_seq

27096f9

# Conflicts: # docs/supported_formats/de.rst # rsciio/de/specifications.yaml # rsciio/tests/test_de.py # rsciio/utils/tools.py

CSSFrancis mentioned this pull request May 29, 2024

Memory spike when loading .mib data lazily #266

Open

ericpre mentioned this pull request May 29, 2024

Add support for dask distributed scheduler in quantum detector reader #267

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add .seq format for DE 16 and Celeritas Camera #11

Add .seq format for DE 16 and Celeritas Camera #11

CSSFrancis commented Aug 15, 2022 •

edited

codecov bot commented Aug 15, 2022 •

edited

CSSFrancis commented Aug 15, 2022

sk1p commented Aug 15, 2022

CSSFrancis commented Aug 15, 2022 •

edited

jlaehne commented Aug 16, 2022

sk1p Aug 20, 2022

sk1p Aug 20, 2022

sk1p Aug 20, 2022

sk1p Aug 20, 2022

sk1p Aug 20, 2022

	The size of each frame buffersin bytes. This includes the time stamp and
	The size of each frame buffers in bytes. This includes the time stamp and

Add .seq format for DE 16 and Celeritas Camera #11

Are you sure you want to change the base?

Add .seq format for DE 16 and Celeritas Camera #11

Conversation

CSSFrancis commented Aug 15, 2022 • edited

Description of the change

Progress of the PR

Minimal example of the bug fix or the new feature

codecov bot commented Aug 15, 2022 • edited

Codecov Report

CSSFrancis commented Aug 15, 2022

sk1p commented Aug 15, 2022

CSSFrancis commented Aug 15, 2022 • edited

jlaehne commented Aug 16, 2022

sk1p Aug 20, 2022

Choose a reason for hiding this comment

sk1p Aug 20, 2022

Choose a reason for hiding this comment

sk1p Aug 20, 2022

Choose a reason for hiding this comment

sk1p Aug 20, 2022

Choose a reason for hiding this comment

sk1p Aug 20, 2022

Choose a reason for hiding this comment

CSSFrancis commented Aug 15, 2022 •

edited

codecov bot commented Aug 15, 2022 •

edited

CSSFrancis commented Aug 15, 2022 •

edited