Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real-Time deskew visualization #11

Open
VolkerH opened this issue Feb 10, 2019 · 15 comments
Open

Real-Time deskew visualization #11

VolkerH opened this issue Feb 10, 2019 · 15 comments

Comments

@VolkerH
Copy link
Owner

VolkerH commented Feb 10, 2019

Opening this issue to continue the discussion with @dmilkie from here:
hammerlab/flowdec#12

I don't think we'll get the deconv running in realtime even with flowdec it still takes a few seconds per stack on our HPC node with a Tesla K80.

I have only recently gotten the Janelia cudaDeconv binary to run on our HPC cluster, but haven't used it much. Never got it to work with LLSpy. Not having access to the source code made me look into alternatives for the deconv and I found flowdec. I haven't benchmarked it but it doesn't feel like it is orders of magnitudes faster than cudaDeconv.

Performing the deconvolution directly on the raw data with a skewed seems to bring a better speed improvement for me as no deconvolution is performed on empty fill voxels.

I discussed some ideas for near-live deskewed volume visualization with David, who runs our LLS.
One idea was to help users orient themselves and navigate around a larger sample by stiching any volumes already captured together and visualize them together with the current position. This would require having information about the stage position in the metadata though and as far as I could see it is not currently recorded (at least in the version of the Labview software that we are using). It was on my TODO list to look at the Labview code to see whether I can add this bit (not very good with Labview though and need to get a license first - currently using the runtime only).

@dmilkie
Copy link

dmilkie commented Feb 11, 2019 via email

@VolkerH
Copy link
Owner Author

VolkerH commented Feb 11, 2019

thanks for those comments.

Incidentially, we ordered a workstation with an RTX 2080Ti just before Xmas and it was delivered before I went to travel overseas. With some luck I will be able to benchmark the code against the K80 on our HPC in three weeks. I am expecting a performance boost, but if it turns out as dramatically as you suggest our users will be even happier.

Regarding the stage coordinates. It somehow hadnt't occured to me to look for tags in the tiffs, I only checked the settings file as most of the other metadata seemed to be there.

For the live visualisation a first step would be to directly save a multi-scale representation for each tiff file. Either just a pyramid with different samples or something like the Big Data Viewer format. With Big Data Viewer each dataset can have an affine transformation (which can be computed from the stage coordinates and the other known parameters). Just not sure how feasible it is to do this at frame rate and with labview. Maybe have Labview write to a RAM disk and then have a different process dealing with the files on the RAM disk: e.g. creating multi-resolution representations and maybe putting it directly in the texture memory of the GPU. I haven't given this too much thought yet and I definitely lack experience with this.
For volume rendering, Spimagine may be an option. One would have to add support for multi-resolution representations though and I'm not sure how much effort that is. We will have the spimagine developer visiting next month so I can hopefully get a better idea what is actually feasible.

First steps will be to finish the batch processing code that implements the routines I outlined in the two notebooks. Comments and suggestions are very welcome.

@VolkerH
Copy link
Owner Author

VolkerH commented Feb 27, 2019

Hi @dmilkie,

I finally got to set up my python environment on our new workstation with the RTX 2080Ti.

The deconvolution of the test volume in this Jupyter notebook

https://github.com/VolkerH/Lattice_Lightsheet_Deskew_Deconv/blob/master/Python/01_Lattice_Light_Sheet_Deconvolution.ipynb

took around 0.2 s for 10 RL iterations. So in the same ballpark as the speed that @eric-czech reported for the GTX 1080Ti and roughly an order of magnitude faster that on the older GPUs we have on our HPC cluster.

So real-time deconvolution seems to be a feasible indeed thanks to flowdec. Exciting.
The VRAM on the card is potentially large enough to deconvolve several stacks in parallel.
A naive approach to try this would be to tile some volumes into a larger volume (boundary treatment would become rather tricky). The drawback is that the PSF would have to be padded to the larger volume. @eric-czech : have you already done any work in terms of parallelization on a single GPU?

@VolkerH
Copy link
Owner Author

VolkerH commented Feb 27, 2019

Just realized that my idea in the previous comment about speeding things up by putting several volumes into the VRAM doesn't make sense as I forgot that the processing time also scales with the number of voxels. Nevermind.

@eric-czech
Copy link

I had not but then again the GPU utilization I see is usually pretty close to 100% continuously during deconvolution -- though I haven't checked that on a small enough time-scale to really know how much of an opportunity there is to interleave other operations from a different process using the same GPU.

Also if it helps, the way I normally process with multiple GPUs (well 2 anyways) is to pick any Python parallelization backend that creates separate processes (not threads as they don't work with TF) and then configure sessions for each worker process like this:

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=1.)
config = tf.ConfigProto(log_device_placement=True, gpu_options=gpu_options)
# IIRC, it is important to set these properties AFTER creating the ConfigProto or they get overwritten
config.gpu_options.allow_growth = True
# This would be set differently based on the current process/worker
config.gpu_options.visible_device_list = "0" # (or "1" for GPU2, "2" for GPU3, etc.)
# Given an acquisition "acq" defined somewhere
res = algo.run(acq, niter=10, session_config=config)

I'm always doing that with completely different images but I suppose that could be made to work with tilings of a single volume too if you added some margins to the tiles to help avoid the boundary artifacts.

@dmilkie
Copy link

dmilkie commented Feb 27, 2019

I just used the test data (deskewed) from the LLS dropbox (2016_02_19_example_decon_deskew_data). The cudaDecon.exe in the LLS dropbox completes 15 iterations in about the same time as the flowdec code does 15 iterations on my P6000. Would you mind benchmarking that exe on your hardware?

Everything we are testing here is using cuFFT for the convolutions. Does anybody think that cuDNN convolutions would be faster? The optimized TensorCores hardware seem to operate only at CUDNN_DATA_HALF, so we might not be able to take advantage of that (but maybe useful for the first several iterations, where we just need to get close?).

There was a comparison done using 2013's hardware (i.e. K40c) made it seem like cuDNN was only good for "for images larger than 200 x 200 and kernels smaller than 15 x 15, the cuDNN convolution library can beat an FFT-based convolution."
http://ska-sdp.org/sites/default/files/attachments/nvidia-sdp-directconvolution_0.pdf

@VolkerH
Copy link
Owner Author

VolkerH commented Feb 28, 2019

@eric-czech
Thanks for that information. Currently, we have a single GPU only. I agree, GPU utilization seems to be close to 100% during deconvolution.
My batch processing doesn't utilize the GPU to 100% yet, probably mainly due to I/O ... I guess I can parallelize that (to read stacks while the previous one is being deconvolved on the GPU) and get closer to 100% utilization overall for batch processing.

@dmilkie
Unfortunately, I cannot do a comparative benchmark as I cannot run cudaDecon on our new machine with the RTX 2080Ti. (Note that this is not the same workstation that Tahnee is using, I only learned a few days ago that the lab of her PI also recently purchased a similarly speced workstation. She mentioned you had to compile a specific version for her as the old binaries are not compatible with the new Nvidia architecture). Also, our new machine at the facility is running Ubuntu 18.04 while Tahnee is running Linux.

From her comments it appears though that she is seeing very comparable performance in terms of speed from CudaDeconv to what I see in flowdec. She also mentioned she sees the speed go up to about 3-4 stacks processed per second if she is running several CudaDeconv processes in parallel. I guess the parallel processes are simply helping with the I/O to keep the GPU utilized for deconvolution at close to 100%. I don't know the exact stack sizes and number of RL iterations she was referring to but I suspect it would be quite similar to the dataset for which I benchmarked the deconvolution with flowdec at 0.2s.

While these are not accuarate benchmarks, they indicate that the speed of cudaDeconv and flowdec are in a similar ballpark. This is also consistent on my impressions on the K80-equipped HPC nodes(for which I managed to get cudaDeconv running a few months back) and with your impressions for the P6000.

I can't really comment on cuFFT vs cuDNN as I was simply looking for high-level libraries that utilize the GPU when I started out on this. The primary goal for me was to have a GPU-powered solution at all as the binary distribution model for cudaDeconv with its NDA (or Research License Agreement) doesn't empower me to troubleshoot or modify anything.

@dmilkie
Copy link

dmilkie commented Feb 28, 2019

I'm making the C++ source code for the CudaDeconv public:
https://github.com/dmilkie/cudaDecon

I can compile it for Windows. It had at one time compiled for Linux, but there are sure to be some errors by now.

This implementation follows the accelerated RL that Matlab uses. The C++ looks pretty efficient when I look at with the Nvidia Visual Profiler.

@VolkerH
Copy link
Owner Author

VolkerH commented Feb 28, 2019

Thank you very much for making the CudaDeconv code public. This is excellent.
I probably won't get to try and compile it for Linux straight away but it is good to have the option and be able ot look under the hood.

@VolkerH
Copy link
Owner Author

VolkerH commented May 9, 2019

@dmilkie
Hi,
just wanted to pick up on this discussion. You mentioned that the Tiff Tag numbers from 40000 onwards in the files generated by the Janelia Labview code should hold the stage positions.

We are running this version of the code (if the information in the settings file is correct)

Version :       v 4.02893.0012 Built on : 3/21/2016 12:20:26 PM, rev 2893  

There do not seem to be any tags 40000.... There is one private tag 32781 that is very long:
image

I can not extract the information from there using AsTiffTagViewer (won't let me copy and paste) and trying to get at this with the python tifffile package tells me that the tag 32781 is malformed and only returns the first few bytes of the tag (I assume up until the first 0-byte). Maybe the following bytes should correspond to other tags, but can't be recognised as such?

Before I go digging further with a HEX editor I just wanted to ask whether I should expect to see the
stage positino with the software version that we're using? Maybe this was added in a more recent build?

@cgohlke
Copy link

cgohlke commented May 9, 2019

the python tifffile package tells me that the tag 32781 is malformed

The value of tag 32781 is not a null terminated 7-bit ASCII string but a sequence of 8-bit BYTEs. Current versions of tifffile will detect this, coerce the tag value to bytes, and emit a warning.

@VolkerH
Copy link
Owner Author

VolkerH commented May 9, 2019

@cgohlke, thanks for commenting. Indeed. I was writing the comment about tifffile from memory.
I can get the whole tag when I use tag.value. I had previously only done print(tag) which gives an abbreviated version of the value in the summary by means of tag.__str__().

So I can now get the whole value of tag 32781, as per this jupyter notebook (towards the end):
https://github.com/VolkerH/PythonSnippets/blob/master/metadata_tifffile/tifffile%20metatada%20experiments.ipynb

I've been looking through this and while it seems to contain lots of metadata I don't see anything
referring to the stage position (but I might have missed it).
So I'm not seeing what @dmilkie mentioned here:

There should be a set of three private tiff tags in the image metadata. Off the top of my head I think tag numbers 40000, 40001, 40002 will have the stage coordinates,

Maybe we simply have a different version of the Labview code?

@dmilkie
Copy link

dmilkie commented May 9, 2019 via email

@dmilkie
Copy link

dmilkie commented May 9, 2019

The Tiff Tag viewer should show the tag.
image
Are you looking at the raw data? Any processing (decon, load+save using matlab, etc) may inadvertantly strip off that info. I remember it was a pain in the butt, and some ugly hacks just to get Cimg to write that tag for me in my decon software.

These tags were added to the .tiff files in rev. 2733, so your version of 2893 should be able to write that. Off the top of my head I can't think of a reason why those special tags wouldn't get written, unless maybe the stage was disabled.

@VolkerH
Copy link
Owner Author

VolkerH commented May 9, 2019

Thanks for your reply.

These are raw images from the microscope, without any processing/resaving.
I just checked some of the most recently acquired images and I don't see those tags.

I will check with my colleague, but I am fairly sure he was running the .exe. Maybe some of the components that are bundled in the .exe don't correspond to the version it claims to be.
We should be able to run the .vis from sources though as we recently also obtained a Labview license for the acquisition machine. I am not in the lab this week, but will consult with my colleague and see whether I can find something in the sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants