Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration with xarray? #79

Open
swamidass opened this issue Oct 21, 2023 · 11 comments
Open

integration with xarray? #79

swamidass opened this issue Oct 21, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@swamidass
Copy link

Would an official integration with xarray be useful? I published an independent package that is fairly small, but should also add most the key functionality. It could stay as an independent package, but I am open to adding it to this library.

https://github.com/swamidasslab/tiffslide-xarray

@ap--
Copy link
Collaborator

ap-- commented Oct 24, 2023

Hey @swamidass,

Thanks for opening the issue!

I think it would useful to add native support for xarray to tiffslide. I wonder what approach would be best here:

When comparing the datasets produced by tiffslide-xarray and a simple approach via xarray's zarr engine, I see there are mainly differences in the added metadata and dimensions:

import tiffslide
import xarray
import tiffslide_xarray

fn = "/path/to/file/CMU-1.svs"

dataset1 = xarray.open_zarr(
    tiffslide.open_slide(fn).zarr_group.store,
    consolidated=False,
)
dataset2 = tiffslide_xarray.open_all_levels(fn)
>>> dataset1
<xarray.Dataset>
Dimensions:  (Y: 32914, X: 46000, S: 3, Y1: 8228, X1: 11500, Y2: 2057, X2: 2875)
Dimensions without coordinates: Y, X, S, Y1, X1, Y2, X2
Data variables:
    0        (Y, X, S) float32 ...
    1        (Y1, X1, S) float32 ...
    2        (Y2, X2, S) float32 ...
Attributes:
    multiscales:  [{'datasets': [{'path': '0'}, {'path': '1'}, {'path': '2'}]...

>>> dataset2
DataTree('None', parent=None)
│   Dimensions:  (y: 32914, x: 46000, c: 3)
│   Coordinates:
│     * y        (y) int64 0 1 2 3 4 5 6 ... 32908 32909 32910 32911 32912 32913
│     * x        (x) int64 0 1 2 3 4 5 6 ... 45994 45995 45996 45997 45998 45999
│     * c        (c) <U1 'r' 'g' 'b'
│   Data variables:
│       image    (y, x, c) <class 'numpy.uint8'> ...
│   Attributes: (12/30)
│       tiffslide.comment:          Aperio Image Library v10.0.51
│   \n46920x33014 [...
│       tiffslide.vendor:           aperio
│       tiffslide.objective-power:  20
│       tiffslide.mpp-x:            0.499
│       tiffslide.mpp-y:            0.499
│       aperio.AppMag:              20
│       ...                         ...
│       aperio.Top:                 23.449873
│       aperio.User:                b414003d-95c6-48b0-9369-8010ed517ba7
│       tiff.ImageDescription:      Aperio Image Library v10.0.51
│   \n46920x33014 [...
│       tiffslide.series-index:     0
│       tiffslide.series-axes:      YXS
│       source_file:                /path/to/file...
├── DataTree('level1')
│       Dimensions:  (y: 8228, x: 11500, c: 3)
│       Coordinates:
│         * y        (y) float64 1.5 5.5 9.5 13.5 ... 3.29e+04 3.291e+04 3.291e+04
│         * x        (x) float64 1.5 5.5 9.5 13.5 ... 4.599e+04 4.599e+04 4.6e+04
│         * c        (c) <U1 'r' 'g' 'b'
│       Data variables:
│           image    (y, x, c) <class 'numpy.uint8'> ...
│       Attributes: (12/30)
│           tiffslide.comment:          Aperio Image Library v10.0.51
│       \n46920x33014 [...
│           tiffslide.vendor:           aperio
│           tiffslide.objective-power:  20
│           tiffslide.mpp-x:            0.499
│           tiffslide.mpp-y:            0.499
│           aperio.AppMag:              20
│           ...                         ...
│           aperio.Top:                 23.449873
│           aperio.User:                b414003d-95c6-48b0-9369-8010ed517ba7
│           tiff.ImageDescription:      Aperio Image Library v10.0.51
│       \n46920x33014 [...
│           tiffslide.series-index:     0
│           tiffslide.series-axes:      YXS
│           source_file:                /path/to/file...
└── DataTree('level2')
        Dimensions:  (y: 2057, x: 2875, c: 3)
        Coordinates:
          * y        (y) float64 7.5 23.5 39.5 55.5 ... 3.287e+04 3.289e+04 3.29e+04
          * x        (x) float64 7.5 23.5 39.5 55.5 ... 4.596e+04 4.598e+04 4.599e+04
          * c        (c) <U1 'r' 'g' 'b'
        Data variables:
            image    (y, x, c) <class 'numpy.uint8'> ...
        Attributes: (12/30)
            tiffslide.comment:          Aperio Image Library v10.0.51
        \n46920x33014 [...
            tiffslide.vendor:           aperio
            tiffslide.objective-power:  20
            tiffslide.mpp-x:            0.499
            tiffslide.mpp-y:            0.499
            aperio.AppMag:              20
            ...                         ...
            aperio.Top:                 23.449873
            aperio.User:                b414003d-95c6-48b0-9369-8010ed517ba7
            tiff.ImageDescription:      Aperio Image Library v10.0.51
        \n46920x33014 [...
            tiffslide.series-index:     0
            tiffslide.series-axes:      YXS
            source_file:                /path/to/file...

I wonder if we can just add all of this metadata by default to the zarrstore we get from tifffile and provide this natively to xarray.

These docs are relevant for defining dimensions in the zarr spec https://docs.xarray.dev/en/latest/internals/zarr-encoding-spec.html

  • Should all metadata be available through xarray?
  • Should we make a list with relevant information we'd like to store in the xarray Dataset?

Cheers,
Andreas 😃

@ap-- ap-- added the enhancement New feature or request label Oct 24, 2023
@swamidass
Copy link
Author

swamidass commented Oct 25, 2023

That's a neat trick with the Zarr store! That aspect is certainly a better way to do this than I have set up currently, and I'll move over to that (unless there are some reasons to keep using the openslide API?). Other than this, let me enumerate the differences:

tiffslide-xarray:

  1. Adds all the metadata as attributes, so it is accessible from the outset in xarray.
  2. Puts all of the levels into different "groups" with the same dimension names, using a DataTree:
  3. Uses 'xyc' for the dimension names.
  4. Adds coordinates that indicate the level0 pixel coordinates of pixel.

In contrast, the approach you used,

  1. Does not add the metadata as attributes.
  2. Puts all the levels in the "same" group with different dimension names.
  3. Numbers dimensions that appear to be have their base name drawn from TIFF metadata?
  4. Does not add coordinates.

Let's address each these differences in turn, and perhaps you can give me some advice.

1. Metadata

I think that the metadata is critical to pass through, for a whole host of reasons. We also add "source_file" to automatically track provenance. Ideally I'd like to think through some metadata conventions that might provide a breadcrumb trail of several things we might do with the data.

My question though is whether or not there is a performance penalty for reading it from the file, or if it is just read in no matter what. If there is a performance penalty (is there?) perhaps we should make adding the metadata optional.

2 and 4. Coordinates, and Same or Different Group/Dimension Names

There are major advantages to including coordinates and placing each level into a different group in a single DataTree. We can do things then like aligned slicing of multiple arrays at once:

region_of_interest = all_slide_levels.sel(x =slice(1000, 1500), y = slice(3400, 4500) )

Critically (and this is the major benefit of xarray), this sort of slicing pulls exact same region of interest (aligned with one other) of all the levels. That's only possible because (1) the dimensions are all named the same and (2) coordinates specify precisely how each image aligns with level 0.

Additional datasets can be added too. For example, a target map and heat map at different sampling level can be added in, and they would be sliced in aligned ways too.

The one concern I have is that inferring coordinates for slides after 0 does make the assumption that the left and bottom most parts of the slide are trimmed when the slide is downsampled. That is usually how it is done, but not always. If that assumption does not hold in the tiff image, there can be a few pixel error in the levels. In most applications that probably isn't too important, but its worth knowing and fixing if possible using meta data, if it exists (does it?).

3. Naming dimensions

I'm inclined to move over to the approach you are using, which seems like it might be more robust. Do all tiff files include an axes argument? If not, what do the names you are using default to?

@swamidass swamidass reopened this Oct 25, 2023
@swamidass
Copy link
Author

Okay, there is some more information that it worth noting, that turns out to be quite important.

I did a test reading the data into memory. Using your code, that would be:

dataset1.load()

Versus

dataset1.load()

Turns out that the zarr based approach is more than 10x slower, at least on my computer, than the read_region based code that I used.

That is substantial difference in performance.

@ap--
Copy link
Collaborator

ap-- commented Oct 26, 2023

Setting mask_and_scale=False prevents a dtype conversion and eliminates the time difference:

import contextlib
import time
import tiffslide
import xarray
import tiffslide_xarray


@contextlib.contextmanager
def measure(label):
    t0 = time.time()
    yield
    print(label, time.time() - t0)


base_path = "/Users/andreaspoehlmann/Development/zarr-benchmark/data/TifffileSVSDataset"
fn = f"{base_path}/CMU-1.svs"

# via zarr_engine
dataset1 = xarray.open_zarr(
    tiffslide.open_slide(fn).zarr_group.store,
    consolidated=False,
    mask_and_scale=False,
)
# via tiffslide_engine
dataset2 = tiffslide_xarray.open_all_levels(fn)

with measure("zarr_engine"):
    dataset1["0"].load()
with measure("tiffslide_engine"):
    dataset2["image"].load()

(1) Metadata

Tiffslide can just add the metadata to the zarr store attrs. That should be easy, and the source filename (if available) can be added to those too. There shouldn't be a large penalty since it's read anyways when you instantiate a TiffSlide object and access properties or read a region.

(2) Coordinates and Dimension Names

I wonder if additional info in the zarr store can be provided to tell xarray to load the data as a DataTree with the relevant info. We could ask that in the datatree repo.

(3) Dimension names

tifffile provides these dimension names. In the general case those can be different dependent on the type of file you load: https://github.com/cgohlke/tifffile/blob/35c19b7c6c8b85aa09e490520677b5a2a31c14e4/tifffile/tifffile.py#L18009C12-L18056C9 tiffslide was written with a focus on tiff-like images used in a digital pathology context. So most of them are 2d images with 3 channels...

@swamidass
Copy link
Author

This is really helpful. And the solution is fairly obscure, so I'm glad we looked into it. I wouldn't have stumbled into that easily.

I think I have a way to do this now. I'll be back in touch soon/later.

@swamidass
Copy link
Author

So I just released a beta version: 0.1b0. You can see the key code here: https://github.com/swamidasslab/tiffslide-xarray/blob/main/tiffslide_xarray/data.py

Turns out the zarr backed approach is the right way to go. By tweaking the chunking, it can be much faster than read_region. Moreover, it is far easier to implement and doesn't rely on any new code. So we just have to trust that tifffile is making a valid Zarr store, and that xarray is using it correctly. Both those dependencies are well tested. So, that is how the library works.

@swamidass
Copy link
Author

As for attributes and coords, you can see in that file how I add them. It is fairly straightforward.

They don't have to be in the zarr store. Instead, I use the metadata from tiffslide to construct coordinates (with units and downsampling), and add attributes to the resulting xarray object. That works just fine.

I do wonder though:

  • Are other sorts of metadata we would want to add that I missed. Can you think of anything? Or anything that I missed?
  • Is there a comprehensive list of fileextensions we expect to signal files tiffslide can read?
  • Is there a way to access the extra images sometimes in files (e.g. the thumbnail, etc)?

@swamidass
Copy link
Author

As for a comparison with approach you initially offered:

# via zarr_engine
dataset1 = xarray.open_zarr(
    tiffslide.open_slide(fn).zarr_group.store,
    consolidated=False,
    mask_and_scale=False,
)

This is what we gain from the tiffslide wrapping around this (which is only a fairly short amount of code):

  1. Documentation of the feature,
  2. obviates need to know two obscure options need to be false
  3. obviates need to knowtiffslide exposes a zarr group
  4. carries over meta data from the slide
  5. computes and downsamples coordinates (which are a key need for xarray)
  6. The resulting xarray can be pickled, and the underlying Tiffslide object is cached for use by multiple objects.
  7. A hook into xarray so xr.open_dataset automatically knows how to open tiffslide compatible files.

I'm noticing too that it is working pretty well as an independent library, and I'm working on some additional extensions to xarray in this package, such as micron from/to pixel management of coordinates. For that reason I don't think it makes most sense to directly fold it into this repo.

I do wonder if a lighter and more helpful solution would be to add a method to the TIffSlide class:

class TiffSlide:
  def to_xarray(...): # 
    try:
      import xarray
      return xarray.open_dataset(self, ...)
   except Exception as e:
       # reraise with user instructions to install optional dependencies

This way, it doesn't change the require dependencies of tiffslide at all, and is a fairly minimal amount of work to add. Though, obviously, testing and documentation always take time. What do you think?

@ap--
Copy link
Collaborator

ap-- commented Nov 7, 2023

Hi @swamidass,

Thanks for implementing the changes in your library already! I'd be happy to integrate xarray support directly into tiffslide, and if you have time, it would be great to discuss the specifics in more detail in a PR. I'll provide a few more details, on how I would imagine that integration to look like, and what issues and edge cases we might face further down the line.

Turns out the zarr backed approach is the right way to go. By tweaking the chunking, it can be much faster than read_region. Moreover, it is far easier to implement and doesn't rely on any new code. So we just have to trust that tifffile is making a valid Zarr store, and that xarray is using it correctly.

Awesome. Tifffile is one of the best maintained libraries out there. And xarray's zarr support is well tested too, so I have zero concerns regarding both.

They don't have to be in the zarr store. Instead, I use the metadata from tiffslide to construct coordinates

I still wonder if the code could be simplified a lot if we'd manually add them to the zarr store .zattrs. We would just need to add the zattrs={'key': 'value', ...} kwarg to this call: https://github.com/Bayer-Group/tiffslide/blob/679ea5a226fd1fa512932f85c8de060b2c8336f8/tiffslide/_zarr.py#L115C43-L115C43

  • Are other sorts of metadata we would want to add that I missed.
  • Can you think of anything?
  • Or anything that I missed?

This depends a bit on the specific use case. For the pathology file formats that tiffslide currently supports, the metadata is usually parsed out of a tag stored in the tiff image and then available in the properties. Some formats store extra images with barcodes, that usually contain information that stores information about the physical sample on the slide. (But that would be something to consider for the future)

  • Is there a comprehensive list of fileextensions we expect to signal files tiffslide can read?

It's less about the file extension, and more about what formats are supported (and detected) via TiffSlide.detect_format()

  • Is there a way to access the extra images sometimes in files (e.g. the thumbnail, etc)?

tifffile can map all of these to zarr stores (or at least I'm not aware of any limitations there). tiffslide tries to provide a openslide compatible interface to those via TiffSlide.associated_images which is a Mapping[str, PIL.Image.Image] but only really tested for svs images, and unsuited for this (but probably a good starting point).

I'm noticing too that it is working pretty well as an independent library, and I'm working on some additional extensions to xarray in this package, such as micron from/to pixel management of coordinates. For that reason I don't think it makes most sense to directly fold it into this repo.

I think one reason to integrate it would be the following:

There are some tiff based (and also non-tiffbased) digital pathology formats that can be represented as zarr arrays, but they require composition of individual arrays into a larger array to represent the final scan. (one supported example is Leica SCN, where individual arrays in the tiff file come from different physical regions of the slide.) Xarray could then be used to build this larger virtual array when all zarr arrays have x and y coordinates corresponding to their real location. Another format example (currently unsupported) would be Mirax (see: #44)

The integration I'd imagine would be a relatively slim tiffslide.xarray submodule that offers a public function:

def to_xarray(slide: TiffSlide, **other_options_that_might_be_needed) -> DataTree:

This submodule could then intially check that only slides that provide "standard" (non-composed) zarrstores can be converted, and gradually add more support for more formats. At some stage it would then become the default backend to support formats like LeicaSCN and Mirax. (assuming it's fast enough)

Let me know if this sounds interesting,
Andreas 😃

@swamidass
Copy link
Author

Thanks for implementing the changes in your library already! I'd be happy to integrate xarray support directly into tiffslide, and if you have time, it would be great to discuss the specifics in more detail in a PR.

Sure, I will make a PR when I have some time.

(one supported example is Leica SCN, where individual arrays in the tiff file come from different physical regions of the slide.) Xarray could then be used to build this larger virtual array when all zarr arrays have x and y coordinates corresponding to their real location.

This is a really good use case, one that xarray would excel on. In fact, it's the sort of thing that could both make supporting Mirax far easier.

Regarding LeicaSCN do you have some test files that exhibit the key features I could look at? Along with explanation of things that wouldn't be obvious from inspecting the TiffSlide attributes and standard functions? If I had those exmamples, I'd have a better idea of what could/should be done.

I think one reason to integrate ...

Another key reason to integrate is to add testing. Right now Zarr is not a tested format, and you haven't made any public "contracts" about how it's interface/attributes will work in the future. So it could accidentally shift at any time. If you add some key test cases there, the whole integration and utility would be far stronger and reliable.

I still wonder if the code could be simplified a lot if we'd manually add them to the zarr store .zattrs. We would just need to add the zattrs={'key': 'value', ...} kwarg to this call:

That could be a good idea. It would simplify the code. More importantly, it would split concerns. You wouldn't have to worry about keeping something stable in the openslide API to better serve users of the zarr store. That decoupling is good practice, and will make this more fun. I think we can be fairly certain that:

  1. Not many (any?) one is currently using the zarr interface.
  2. It would be strange for code to rely on attributes not being present in the zarr file.

For those two reasons, I think we can iterate pretty easily on this if you are willing to release some new versions of tiffslide (or create a new branch). I'd suggest that in this case a "kitchen sink" approach makes a lot of sense. I'd just add all the metadata you can put in that would not be misleading. I'd suggest these rules:

  1. Being sure to place attributes on the "right thing," i.e. make sure that attributes that apply to level 1 are on the level 1 group, not on the level 2 group, etc.

  2. Consistently prefix computed/normalized attributes with "tiffslide." as you've already done in the usual attributes!

  3. Keep all attributes currently in use in the public interface entirely unchanged, and ensure all of them appear in the .zarr attributes.

  4. Pass through all raw annotations from the slide to the zarr store entirely unmodified, except where there are datatype errors that need to be managed, or the attribute was "None" (in which case you can just drop it). It seems you already follow this principle, but there is a lot of metadata you haven't put into attributes. So maybe there is some filtering going on?

  5. Add associated images as groups too, as long as there isn't a performance hit. If there is a performance hit, we should try to list out the names of them in the attributes, and provide a different interface for loading them (most likely an option).

If you can do these things (particularly #1 and #3, #4 and #5) there wouldn't be any reason to use the openslide mimicing interface. We'd be able to rely exclusively on the zarr interface, which would indeed greatly simplify the code.

Once we start exposing attributes this way, we really should document them in the tiffslide documentation, at least with a sentence description.

IMHO, I think this would be a good direction to move forward on. I think we could really increase the utility of the library beyond primarily being a migration library from openslide. And it might even make supporting Mirax easier to do too.

What do you think?

For Down the Road...

  1. II hope those PIL associated images are lazy constructed, as usually they aren't needed.

@ap--
Copy link
Collaborator

ap-- commented Nov 9, 2023

Regarding LeicaSCN do you have some test files that exhibit the key features I could look at? Along with explanation of things that wouldn't be obvious from inspecting the TiffSlide attributes and standard functions? If I had those exmamples, I'd have a better idea of what could/should be done.

you can get test images here: https://openslide.cs.cmu.edu/download/openslide-testdata/Leica/

TiffSlide parses the metadata information in those files and extracts a SeriesCompositionInfo dictionary, which contains the shapes of the composited levels as well as the locations of the seriess in each of those composited levels (basically the offsets). It's stored under the "tiffslide.series-composition" key: https://github.com/ap--/tiffslide/blob/3b6b1fc490c3acf3bd136eaeba77b67954c1dc01/tiffslide/tiffslide.py#L869-L1002

This information is then used here: https://github.com/ap--/tiffslide/blob/3b6b1fc490c3acf3bd136eaeba77b67954c1dc01/tiffslide/_zarr.py#L195-L236 to retrieve offset array data from the composited image. The current implementation checks every tiff series for potential overlap, and is only fast-enough, because Leica SCN images have (at least the ones I've seen) just a few series. For formats like Mirax this will be on the order of ~100s or ~1000s and will need a spatial index to be fast.

If you can do these things (particularly 1 and 3, 4 and 5) there wouldn't be any reason to use the openslide mimicing interface.

Tiffslide was built to be used as a cloud-native drop in replacement for openslide, so the openslide interface is important for its main usecase. Users could always use tifffile directly if they need more direct access. The TiffSlide.zarr_group property is a convenience attribute specifically for that.

That being said:

  • work on (3) is important, because it is also useful for serializing all extra metadata of the wholeslide image to a kerchunk compatible dict for direct cloud access. This is interesting for larger image datasets.
  • Regarding (1): The metadata is usually specific to the image file. Composition data would likely be stored for now as described above on the lowest level.
  • I'm not sure I completely follow (2 and 4). All of them are already implemented in tiffslide.
  • And I'm happy to accept PRs for (5). I don't think it's required for this current iteration.

And it might even make supporting Mirax easier to do too.

That would be a big win.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants