Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to ProjectImageEntry.readImageData() to avoid accessing the file image #1488

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

carlocastoldi
Copy link
Contributor

@carlocastoldi carlocastoldi commented Mar 26, 2024

WHAT

This PR improves performance when running a script that does not need to access the image files on multiple images.
Additionally, allowed to modify ObjectClassifierCommand too so that it can read all detections' measurements in the training set without uselessly reading the image files.
This last change alone allowed, on my projects, to improve the time when creating an object classifier from ~10/15minutes to ~5seconds.
I feel like this change is useful when a laboratory works with large projects with image locations being on a remote server. This is also possible thanks to QuPath's amazing design choice to never directly modify images.
If the scripts being run wants to access the images' pixels, it gracefully halts the execution of the all the following project entries too.

example:

import qupath.imagej.tools.IJTools
import qupath.lib.images.PathImage
import ij.ImagePlus

var server = getCurrentServer()
var downsample = server.getDownsampleForResolution(Math.min(server.nResolutions()-1, 4))
PathImage<ImagePlus> pathImage = IJTools.convertToImagePlus(server, RegionRequest.createInstance(server, downsample))

"Run for project (without saving and opening)":

INFO: Starting script at Tue Mar 26 15:20:37 CET 2024
ERROR: The script tried to read pixels off an image while also requiring to run the script without accessing the image files.
WARN: Script cancelled with 53 image(s) remaining
INFO: Processed 54 images
INFO: Total processing time: 280 milliseconds

HOW

Essentially this works by creating a ImageServerStub that extends AbstractImageServer. It retrieves metadata from the ProjectImageEntry itself (which in turn, i think, it gets them from the .qpproj file) and fails when readRegion() is being called. Additionally, it does not provide a server builder. This way, if the resulting image data are to be saved, the original ImageServer won't be overwritten/lost.
You can now pass a openImage boolean to ProjectImageEntry.readImageData() that, when false, just avoids getting the default image server, but just uses an instance of ImageServerStub.

When running a ProjectTask, it will catch whether the script tried to access the image file. If it did, it stops the execution for the current image and all the following in the queue.

Minor proposal

Finally, i'd like to discuss whether we could initially run all scripts with ImageServerStub by default and, only if they fail because they need to read the image files, run them with the correct ImageServer.
This point, however, is not really important and can be addressed in a future PR as well.

@petebankhead
Copy link
Member

Thanks, this is an interesting idea and certainly seems to help in the scenario you describe.

I'm apprehensive about merging quickly because it would be quite a significant change, and add complexity when we're trying to reduce it. So it will take some thought and I'd like to understand the problem better.

This last change alone allowed, on my projects, to improve the time when creating an object classifier from ~10/15minutes to ~5seconds.

Can you explain why it takes so long? Huge numbers of images? Slow file format, or is it where the images are stored?

Additionally, allowed to modify ObjectClassifierCommand too so that it can read all detections' measurements in the training set without uselessly reading the image files.

The ObjectClassifier takes an ImageData by design because an object classifier could require pixel access... and this is very likely to be important in the future.

This is because, when I rewrote object classifiers some years ago, I was thinking of future classifiers that will use deep learning models to classify based upon image patches - and not only measurements. That's why there is also a general FeatureExtractor class.

This all basically works, we just haven't yet had time to wrap it up for wider use.

You can now pass a openImage boolean to ProjectImageEntry.readImageData() that, when false, just avoids getting the default image server, but just uses an instance of ImageServerStub.

While not identical, the current ProjectImageEntry.readHierarchy() is intended for when you need objects but not everything else. This already lets you access all measurements etc. without touching the image.

You can then create a new ImageData with a dummy ImageServer if you need to.

So an alternative approach might be to try to script creating a classifier without needing to go through the UI, in a way that doesn't involve any big API changes - and which can be used when you can know in advance that the image doesn't need to be accessed.

@carlocastoldi
Copy link
Contributor Author

Thank you for taking your time into understanding our problem and coming up with possible alternative solutions!

I absolutely get your notes on the classifier. I assumed that the ObjectClassifier would never need to read pixels because it currently only uses the detections' measurements. Surely the code must be changed in preparation of the advent of the FeatureExtractor. And for this, using ProjectImageEntry.readHierarchy() is probably the better option.

[...] creating a classifier without needing to go through the UI [...]

I am not sure about this, though. You often want to leverage the live-update feature when creating a classifier. That is one of the most handy feature when tweaking a classifier. If that option was removed, it would be unfortunate.


However i think you missed a point. The major issue this PR wants to address is the ability to a script in batch as fast as possible (and when it is possible)

an alternative approach [...] that doesn't involve any big API changes - and which can be used when you can know in advance that the image doesn't need to be accessed.

Just to be clear, this PR's only API change is adding in ProjectImageEntry.java:L195, where it adds a new public method readImageData(boolean) asking whether to read or not the image file. It also provides a default implementation readImageData() that always reads it, so that all previous code relied on this assumption don't break.
Furthermore, as you suggested the current approach can be used when you know in advance that the image doesn't need to be accessed:
image

Now, I agree this interface may not be the best one as it could easily just be a checkbox option in the ScriptEditor. For now, though, i think it is enough to enjoy the benefits that this PR brings.

This last change alone allowed, on my projects, to improve the time when creating an object classifier from ~10/15minutes to ~5seconds.

Can you explain why it takes so long? Huge numbers of images? Slow file format, or is it where the images are stored?

Basically All you the above. It accesses ~50 .czi images, weighting ~10GiB each. Leaving the the big size aside, it's also known that BioFormats themselves (required by .czi) are slow to work with.
Adding up to this scenario, images are often stored on a remote server (be it OMERO or just a sftp/samba server). This is due to the fact that it is hard for every member of a laboratory to have them stored on their own computer, as they all need to collaborate and, even if they wanted, they would soon fill up their local storage.

in this scenario, the fact that QuPath offer an highly-scriptable interface is many laboratories' luck. It means that a pipeline of scripts can be applied one after the other, resulting in a full analysis of the whole project. Sometimes, you may even want to tweak some parameters, and for this you have to re-run the whole pipeline in batch again.

In the end, I really get you are wary of quickly merging this. I'm sure reducing code is a top priority in order to offer the best bug-free experience to QuPath's wide user-base. A base experience that can be extended by plug-ins and scripts outside of QuPath's code/responsibility. However, I feel like the ability to run scripts in QuPath is at the core of its streghts. It allows to offer an extensible basis on which downstream developers can then build upon.
For this reason I ask you to please also consider future scenarios that this small PR opens up in batch script execution, and not only its implications with the object classifier.

Thank you!

@carlocastoldi
Copy link
Contributor Author

With the latest commit I also added an option in the CLI interface to run the script for the whole project without accessing the image files.

./gradlew run --args="script -p '/home/castoldi/426FC/project.qpproj' -n -c 'import qupath.imagej.tools.IJTools; println IJTools.convertToImagePlus(getCurrentServer(), RegionRequest.createInstance(getCurrentServer(), 16))'"

output:

> Task :qupath-app:run
14:08:12.506 [main] [INFO ] qupath.lib.gui.prefs.PathPrefs - Setting default Locale to en_US
14:08:12.507 [main] [INFO ] qupath.lib.gui.prefs.PathPrefs - Setting Locale for FORMAT to en_US
14:08:12.507 [main] [INFO ] qupath.lib.gui.prefs.PathPrefs - Setting Locale for DISPLAY to en_US
14:08:12.513 [main] [INFO ] qupath.lib.common.ThreadTools - Setting parallelism to 31
14:08:12.513 [main] [INFO ] qupath.ScriptCommand - Setting tile cache size to 8000.00 MB (25.0% max memory)
14:08:12.525 [main] [INFO ] qupath.lib.scripting.QP - Initializing type adapters
Warning: Versions of org.bytedeco:javacpp:1.5.9 and org.bytedeco:opencv:4.6.0-1.5.8 do not match.
Warning: Versions of org.bytedeco:javacpp:1.5.9 and org.bytedeco:openblas:0.3.21-1.5.8 do not match.
14:08:12.860 [main] [INFO ] qupath.ScriptCommand - Running script for SILVA_426.1 FC.czi - Scene #1 (0/48)
14:08:13.205 [main] [ERROR] qupath.ScriptCommand - The script tried to read pixels off an image while also requiring to run the script without accessing the image files.
[...]
14:08:24.832 [main] [INFO ] qupath.ScriptCommand - Running script for SILVA_426.6 FC.czi - Scene #8 (48/48)
14:08:25.124 [main] [ERROR] qupath.ScriptCommand - The script tried to read pixels off an image while also requiring to run the script without accessing the image files.

BUILD SUCCESSFUL in 18s
34 actionable tasks: 10 executed, 24 up-to-date

~/Projects/qupath light-script-runner* 19s
❯ 

petebankhead added a commit to petebankhead/qupath that referenced this pull request Mar 28, 2024
Rough draft at an alternative approach to address the issue raised in qupath#1488
@petebankhead
Copy link
Member

I think I do get the point, but want to ensure it's clear what exactly should be solved here, as I suspect there are alternative approached to consider.

For example, I quickly drafted a rough alternative at #1489

This simply delays loading images until the ImageServer is requested. It has the advantages of being simpler (no need for different 'Run' actions), doesn't introduce any new ImageServerStub class, and avoids failure if pixels are requested.

It probably has disadvantages too, as calling code needs to be more careful not to request the server at all (even for metadata), to avoid triggering the image to be loaded.

I'm not sure which is best, but we should go with the most maintainable solution that solves the main problem.

@carlocastoldi
Copy link
Contributor Author

I really like your draft. It seems the most beautiful, design-wise. The ImageServerStub solution in the end may just work as a helper for writing "good" fast scripts, exposing exactly where the image files are absolutely needed.

I made a small comment on your draft PR about the metadata, as i feel that is an important info to access offline. But in the end, if that was sorted out, that solution would be a drop-in replacement to mine. As you said, at last it will come down to which one is the most maintainable. I see pros and cons in both: ImageServerStub offers a solution that is segregated in one file, but then requires to punch multiple small holes in QP interface in order to use it; lazy-server distributes the code responsibility to multiple classes and requires to be careful in future development of QuPath so that it does not end up requesting for the server when it is not really useful. In the latter case it is due to the solution having a silent behaviour. However, since everything is managed internally in the lazy approach, in the future it may create less problems surging from punching holes in QuPath's interface.

Ultimately, I think your solution is better maintainable-wise, granted that a few things are managed:

  • have the retrieval of the image server be loud in logs. Perhaps even with some traceback to what portion of code triggered it?
  • expose a getCurrentMetadata() function to avoid having to do getServer().getMetadata()
  • check qupath code that requested for the server but may not need it.

…e file image

This improves performance when running a script on a project-wide as well as when creating an object classifier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants