Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image load failures due to bad abort support in geotiff #604

Open
acthp opened this issue May 22, 2022 · 30 comments
Open

image load failures due to bad abort support in geotiff #604

acthp opened this issue May 22, 2022 · 30 comments
Labels
bug Something isn't working

Comments

@acthp
Copy link

acthp commented May 22, 2022

Describe the bug
Image tile loading fails after some use of the UI. Some tiles fail to load, and zoom/pan does not resolve the issue.

I can avoid the bug from viv by removing the abort signal in the calls to geotiff. This might be a good workaround for viv until geotiff is fixed. However, the blockedsource.js file in geotiff looks extremely broken. I suspect there are several more race conditions due to incorrect use of async/await, but the abort code is the easiest one to hit from avivator.

To Reproduce
Load ome tiff in avivator. Load a new channel. While channel is loading, zoom and pan the image. Eventually geotiff will throw an AggregateError, and some tiles will never load.

Expected behavior
Tile loading continues to work

Screenshots

Environment:

  • Release or git hash: 0.12.8
  • Browser: chrome
  • Browser version:
@acthp acthp added the bug Something isn't working label May 22, 2022
@ilan-gold
Copy link
Collaborator

ilan-gold commented May 23, 2022

I am unable to reproduce. Are you using the demo site? This was a known issue (to me) but was able to fix it by updating geotiff (we contribute to that repo). Could you provide a screen video or something?

@ilan-gold
Copy link
Collaborator

Specifically, if you're not using the demo site, make sure to set up cacheSize: 0 as documented - I've seen this error before when not setting this.

@acthp
Copy link
Author

acthp commented May 23, 2022

Here's a screencast of getting the AggregateError on the demo site.

avivator-error

Image link https://avivator.gehlenborglab.org/?image_url=https://xenapublicimages.s3.amazonaws.com/MEL08-1-1.ome.tif

@ilan-gold
Copy link
Collaborator

I will look into this then. I suspect it has to do with a recent PR we made there. This used to be a problem but I was nearly 100% certain I had fixed it in a local copy of geotiff we had - we then merged that copy into the real thing but it's definitely possible something was lost in translation. Yes, that section of the code is a mess - any help there would be appreciated. It's not completely unapproachable, just takes some debugging.

@ilan-gold
Copy link
Collaborator

ilan-gold commented May 23, 2022

As an aside, I think there may be a CORS issue with Firefox and your bucket.

EDIT: I am actually having trouble even with localhost i.e the local dev version of Viv....

@acthp
Copy link
Author

acthp commented May 23, 2022

Regarding blockedsource.js, the code looks very confused, but maybe I don't understand the intent. So, my analysis could be wrong.

The first major issue is that it inspects the cache, then releases the cpu (by use of await), then pulls from the cache. This can't work reliably because the cache is dynamic, it changes over time. The values cached before await may not be there after await. Every use of await introduces another race condition. The code should instead pull what it can from the cache at the top of fetch(), and hold on to those values. That should be the only use of cache in fetch(). It should hold on to the requests it makes, rather than expecting them to appear in cache later, because, again, the cache is dynamic: every addition to cache can remove something from cache, so you can't wait on multiple things and then expect them to all be there.

The second issue is the abort code. I suspect the intent of the API is that the same tile will not be fetched twice if requested more than once in a short time interval. Without reference counting the number of requests on the same tile, I don't believe there's a reliable way to support abort, because with no count there's no way to know if the tile is still needed. The current code attempts to work around this by retrying aborted tiles, but this introduces races on this.abortedBlockIds.

I suspect the desired functionality requires many fewer state variables than the current implementation. I could try reworking it, but I would probably do a clean rewrite rather than trying to fix what's there.

@ilan-gold
Copy link
Collaborator

ilan-gold commented May 23, 2022

I would be open to that @acthp - I held some similar suspicions of the code but could not have wrote it clearly like you did. Thanks so much - if you don't implement it, I will probably take a stab at it but community help is always appreciated!

As for the abort code, this comes from deck.gl (although I helped, it was not something I requested). The intent is less fetching the same tile twice than cancelling tiles that have been "passed through" as you zoom in. I don't think the same tile is requested twice in general use cases (and if it has been loaded, is cached by deck.gl) although I suppose if you zoom in and out on the same tile, this could happen where the first request is cancelled - this wasn't really the intention for the feature I imagine (since this isn't too common an operation).

@acthp
Copy link
Author

acthp commented May 23, 2022

Ok, I'll see what I can put together in the next few days.

Sorry about the CORS issue on that image. If you want to test it I believe localhost:8080 is allowed. I can open it up more if you need it.

@ilan-gold
Copy link
Collaborator

Ah, we use port 3000 for development. I'll circle back to this on Friday to check in. Thanks!

@ilan-gold
Copy link
Collaborator

ilan-gold commented Jun 3, 2022

Have you had a chance to look at this @acthp? Otherwise I'll look at it early next week.

@acthp
Copy link
Author

acthp commented Jun 3, 2022

Yes, though I'm mostly focused on writing tests atm. I think there are three groups of bugs: races on the cache, races on the abort, and a suspected off-by-one that leads to an exception when the request range matches the block size. Trying to shore up the tests to make sure there's not more.

@ilan-gold
Copy link
Collaborator

Amazing, thank you! The last one sounds horrifying, been in a similar place with this library!

@Dev-Lan
Copy link

Dev-Lan commented Feb 15, 2023

Hi all, I'm curious what the current state of this issue is? I am working on using parts of VIV code (specifically the loader/ImageLayer) and I've run into a similar issue that I currently suspect has the same root cause.

If it is helpful I can share/explain what I am running into.

@acthp
Copy link
Author

acthp commented Feb 15, 2023

@Dev-Lan I rewrote the fetch code in geotiff, which resolved the various races we were experiencing. Code is here:

https://github.com/ucscXena/geotiff.js/commits/ucsc-beta

I haven't kept it up to date. I may come back to it in the future, but currently it's not clear if we will continue with viv, due to performance concerns.

@ilan-gold
Copy link
Collaborator

@Dev-Lan Could you expand on the issue? @acthp Could you also expand on your performance concerns, even if you are using a fork? Is this about tiff or the library? We do support OME-NGFF.

I am wondering if it is time to discuss a plan for "adopting" the geotiff.js project. I know there are other folks out there who are interested in this project but also find it problematic because the lead developer responds extremely slowly to things.

@ilan-gold
Copy link
Collaborator

There was also geotiffjs/geotiff.js#336 which may help alleviate your issues. Have you tried this? It purports to solve some issues.

@ilan-gold
Copy link
Collaborator

This shouldn't be a fix, because we should be setting the cache size to be infinite, but maybe it has a side-effect?

@acthp
Copy link
Author

acthp commented Feb 16, 2023

@ilan-gold Re: performance, I haven't been in this code for awhile so my recollection is probably wrong in places. But as I remember it, the main issue we've had is lag in rendering tiles. We've also had problems with the main thread being blocked, freezing the whole UI.

When I looked into it, it appeared from back-of-the-envelope calculations that it was fetching a lot more data than it should need to render a given view. It canceled requests much less frequently than I would expect. For example, when zooming and panning rapidly I would expect to see cancellations for views that were merely passed through, but I rarely saw cancelled requests. So, most of the time was spent waiting for tiles that weren't used. To bring the data sizes down we tried some compressed formats, but that would lock up the UI for seconds at a time. It appears that the decompress is happening on the main thread. Also, the javascript decompression performance was poor, and pretty much negated any potential gain from the compression.

While troubleshooting these issues I felt that the code complexity was quite a bit beyond what was necessary for the problem, which made performance analysis difficult. Some of that is inherited from the deckgl and geotiff APIs. E.g. if I recall correctly the tile requests are driven from deckgl, so analysis of tile fetch and cancel require digging into that code. Geotiff implements BlockedSource, but the underlying image is already tiled, so there are two different chunking mechanisms working on top of each other. I'm not sure this is necessary.

We think it might be easier for us to build a system that has just a few abstractions, and fetches tiles that are individual, pre-computed, compressed image files so the browser can render them natively in the C++. If we do try to continue rendering directly from the pyramid-style image files, at minimum we would put the decompress on a separate thread, and would likely move the code to webassembly.

Again, sorry for any inaccuracies. I hope to get back to this in a few weeks.

@Dev-Lan
Copy link

Dev-Lan commented Feb 16, 2023

I'm experiencing two of these, though potentially exposed in a slightly different way. I have a tiff stack of low resolution time-varying data. Panning/zooming does not trigger any fetches, but scrubbing through time does.

We've also had problems with the main thread being blocked, freezing the whole UI.

Right now, this is my biggest issue. This is somewhat random, but can be triggered by 2 quick updates causing 2 fetches for images. The whole page becomes unresponsive.

I would expect to see cancellations for views that were merely passed through

This is a smaller issue since I can work around it with the UI, but was surprising. Again, if I scrub through say 5 frames quickly, if deckgl or fetching can't keep up I would expect subsequent changes to abort/override previous changes, and get a single visual update with the 5th frame. Instead, when I'm done scrubbing all 5 frames show in sequence.

Like I said, this is fixable with a debounce or throttle, but I did try adding imageLayer.value?.state?.abortController?.abort(); before I make a new deckgl.setProps call. This did result in the UI behavior I was expecting, though it triggered new console errors and did not fix the crash issue.

It is also definitely possible I'm using something wrong or at least unexpectedly. I'm using vue, and have had the most success so far using plain js deckgl, and using the loaders and I'm layer classes from viv.

My code is at https://github.com/visdesignlab/aardvark. 90% of the relevant code is at https://github.com/visdesignlab/aardvark/blob/main/src/components/ImageViewer.vue, and a little bit is in https://github.com/visdesignlab/aardvark/blob/main/src/stores/imageViewerStore.ts

I can also provide videos or meet virtually if that would be useful in any way.

@Dev-Lan
Copy link

Dev-Lan commented Sep 12, 2023

Hi again! I'm getting back into my project using this code. I was wondering if there have been any developments on this issue? As is I can still easily cause the whole page to freeze from loading subsequent frames too quickly.

I have tried swapping out the geotiff code from the fork in the previous comment, but haven't been successful in getting it to build.

  1. I've tried using "overrides" in my package.json — this doesn't work for monorepos.
  2. I've made a fork of viv with the updates and tried pointing to that github repo directly, I couldn't get this to build either.
  3. I've tried just referencing the subpackage of my fork for the loaders package using (https://gitpkg.vercel.app/), this is failing to import the geotiff from the github fork.

@ilan-gold
Copy link
Collaborator

ilan-gold commented Sep 14, 2023

As I see it there are two issues:

  1. There is a problem where doing things "too quickly" causes outright errors. This can be solved by using (from what I remember) https://github.com/ucscXena/geotiff.js/commits/ucsc-beta which should probably be merged with the main branch of geotiff. However the maintainer of that repo can be a bit spotty. I would not be opposed to moving on to a fork again but we would need to collectively test this out.
  2. Interactions that are too quick cause lags/delays/incorrect loading, but not errors. This likely comes from deck.gl. So if you want that to improve, I think that is the place to look into things.

Something you can do is use geotiff's Pool to do decoding on a worker thread if you are building your own application and can figure out the bundling (if I recall that is why it is disabled in Avivator). It should be documented in our API. As to how to do that.

What @acthp suggests in his last comment (C++ with precomputed tiles) sort of goes against what this repo does so if you really need that performance, then that might be the route to go. But if you're pre-computing compressed tiles, I suspect that our repo's performance will actually be fine since the computation cost of decompression will go down substantially as to not break anything.

@ilan-gold
Copy link
Collaborator

P.S @Dev-Lan if you're issue is just bundling a geotiff fork, we can try to help but will need a small reproducible example.

@Dev-Lan
Copy link

Dev-Lan commented Sep 14, 2023

Right now, I'm just trying to fix the first issue. I certainly would appreciate any advice if you have it regarding the bundling.

Of my three approaches described above, I think 1 is a dead-end. For both option 2 and 3 I am using a fork I created from VIV that references the ucsc geotiff fork.

Option 2: load viv monorepo from my fork
Steps:
1.git clone -b test-geotiff-fork-full https://github.com/visdesignlab/aardvark.git
2. yarn install
3. yarn run dev
4. ERROR:
✘ [ERROR] Failed to resolve entry for package "@hms-dbmi/viv". The package may have incorrect main/module/exports specified in its package.json. [plugin vite:dep-scan]

If I look at the viv package in node_modules it is not being built the same way as if I get it from npm. It is clearly not correct as is, but I'm not sure what is needed to get it to build correctly from github.

Option 3: get subpackage for loaders only.
Steps:

  1. git clone -b test-geotiff-fork https://github.com/visdesignlab/aardvark.git
  2. yarn install
  3. yarn run dev
  4. ERROR:
    ✘ [ERROR] Failed to resolve entry for package "geotiff". The package may have incorrect main/module/exports specified in its package.json. [plugin vite:dep-pre-bundle]

Let me know if any additional info would be helpful.

@acthp
Copy link
Author

acthp commented Sep 14, 2023

I don't currently have plans to continue with this. Regarding #2, in addition to deck.gl over-fetching, we also saw over-fetching due to inappropriate tile sizes. All of the images we've tested so far have required rebuilding due to various encoding errors, e.g. mixing up dimensions, botching the tiles, etc. This is pretty typical for research data, and it undermines the utility of being able to directly render from these formats. So, we're building conventional pyramids of png files during our image data ingest, and rendering with deck.gl. I will update this thread if I'm able to reduce the deck.gl over-fetching.

re: build, I had to fork viv to inject the patched geotiff. Diffs are here: master...ucscXena:viv:ucscxena

There's probably a better way, but I don't know what it is.

@Dev-Lan
Copy link

Dev-Lan commented Sep 14, 2023

re: build, I had to fork viv to inject the patched geotiff. Diffs are here: master...ucscXena:viv:ucscxena

Thanks for the link, this is a helpful reference! I'll give this another shot.

Edit: So far no luck :'(

@Dev-Lan
Copy link

Dev-Lan commented Sep 19, 2023

I ended up just making a local copy of just the loaders package from viv and updating it to use ucsc-xena-geotiff instead of geotiff.

This does help, but I can still trigger a "hard crash" of the webpage, where the page becomes unresponsive, and I need to close the tab.

@Dev-Lan
Copy link

Dev-Lan commented Sep 19, 2023

Alright, "hopefully," last update. I found something that appears to have resolved my issue. I think @ilan-gold's suggestion for using a pool finally clicked. Specifying a pool in the options when I call loadMultiTiff seems to fix the issue.

import { Pool } from 'geotiff';

...

const loader = await loadMultiTiff(
    ...,
    { pool: new Pool() }
);

@ilan-gold
Copy link
Collaborator

Yay!

@keller-mark
Copy link
Member

@ilan-gold is passing a pool value recommended? I notice we are not doing this in Vitessce - should we be?

@ilan-gold
Copy link
Collaborator

ilan-gold commented Sep 28, 2023

I think perhaps the reason for this was the workers on the heatmap but that is something of a moot point now since it almost certainly will not need to use workers anymore for spatial data (heatmaps for spatial data tend to be smaller).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants