Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowness seen using PiecewiseAffineTransform compared to scikit-image version #698

Open
JHancox opened this issue Feb 8, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@JHancox
Copy link

JHancox commented Feb 8, 2024

Describe the bug
The cucim.skimage.transform.PiecewiseAffineTransform seems to be several times slower than the scikit-image equivalent

Steps/Code to reproduce bug
When running the code below, I observe a 8x slowdown for the estimate and 2x slowdown for the warp operations using the PyTorch 24.01 container with cucim 23.12

Expected behavior
The code should execute at least as fast as the cpu version

Environment details (please complete the following information):
Docker on Ubuntu 22.04
PyTorch 24.01 container with scikit-image and cucim 23.12 pip installed

Additional context

`import matplotlib.pyplot as plt
from skimage.transform import PiecewiseAffineTransform, warp
from scipy.interpolate import LinearNDInterpolator
import numpy as np
from timeit import default_timer as timer
from cucim.skimage.transform import PiecewiseAffineTransform as cu_PAT
from cucim.skimage.transform import warp as cu_warp
import cupy as cp
   
# create some offsets and coordinates
vectors = np.array([[3.0,1.0],[-5.,-1.3],[-3.5,8.3],[0,0],[0,0],[0,0], [0,0]])
coords = np.array([[20,20],[180,50],[20, 180],[0,0],[0,255],[255,0], [255,255]])

# Create grid
step_size = 20
x = np.linspace(0, 255, num=step_size)
y = np.linspace(0, 255, num=step_size)
X, Y = np.meshgrid(x, y)

interpx = LinearNDInterpolator(list(coords), vectors[:,0])
Zxi = interpx(Y, X)

interpy = LinearNDInterpolator(list(coords), vectors[:,1])
Zyi = interpy(Y, X)

# create an array of coords
src = np.column_stack((X.reshape(-1), Y.reshape(-1)))

# add the interpolated offets
dst_rows = X + Zxi
dst_cols = Y + Zyi

dst = np.column_stack([dst_cols.reshape(-1), dst_rows.reshape(-1)])

# compute transforms
tform = PiecewiseAffineTransform()

start = timer()
tform.estimate(src, dst)
print("cpu estimate took {}s".format(timer()-start))

start = timer()
out = warp(imgrid, tform, output_shape=(255, 255))
print("cpu warp took {}s".format(timer()-start))

# repeat using cupy/cucim.skimage
cu_tform = cu_PAT()
start = timer()
cu_tform.estimate(cp.array(src), cp.array(dst))
print("gpu estimate took {}s".format(timer()-start))

start = timer()
out = cu_warp(cp.array(imgrid), cu_tform, output_shape=(255, 255))
print("gpu warp took {}s".format(timer()-start))
`

@JHancox JHancox added the bug Something isn't working label Feb 8, 2024
@JHancox JHancox changed the title Slowness seen using PiecewiseAffineTransform compared to scipy version Slowness seen using PiecewiseAffineTransform compared to scikit-image version Feb 8, 2024
@grlee77
Copy link
Contributor

grlee77 commented Feb 14, 2024

Hi @JHancox, thanks for reporting this. Can you specify the shape and dtype of imgrid?

Unfortunately PiecewiseAffineTransform is an outlier in cuCIM in that it currently does not actually have proper GPU implementation and will be faster on CPU. We should consider printing a warning at runtime and adding a Note to this effect in the docstring or removing it from the library. It currently has to copy to CPU to run scipy.spatial.Delauney which CuPy does not have a GPU implementation for.

warp should be faster on the GPU if the image is sufficiently large, but in this case with inverse_map being a PiecewiseAffineTransform callable rather than a cupy.ndarray it will be slow due to that.

In general, for warp if you are able to supply inverse_map as a cupy.ndarray instead of a callable and the image is not too small the GPU should be faster. A quick rule of thumb is that the CPU is expected to be faster if an image is very small like (256, 256) (especially if it fits in L1 cache size of the CPU). For medium sizes such as (512, 512) or (1024, 1024) the GPU should be becoming faster. Above several MB in size, the GPU should be much faster. For the GPU, it is also beneficial to ensure that the input is single precision to avoid relatively slow double precision on the GPU.

I don't doubt that the GPU is slower here, but wanted to mention that using timer for the comparison has a couple of potential pitfalls to be aware of

  • GPU times will be much slower the first time a function is called because any kernels get compiled and cached (fortunately this .cubin cache is persistent on disk across program runs so this is a one time cost).
  • GPU times can be misleadingly short in some cases where synchronization may not have been performed, so it is best to explicitly call cupy.cuda.Device().synchronize() before checking the final time to make sure the kernels have completed.

To handle the above issues automatically, CuPy provides a benchmark timing utility that can be used like this

from cupyx.profiler import benchmark

perf_cpu = benchmark(
    warp,
    args=(imgrid, tform),
    kwargs=dict(output_shape=255, 255),
    n_warmup=10,
    n_repeat=10000,
    max_duration=5)  # cap at 5 seconds duration
print(f"warp: avg CPU time = {perf_cpu.cpu_times.mean()}")


cu_imgrid = cp.array(imgrid)

perf_gpu = benchmark(
    cu_warp,
    args=(cu_imgrid, cu_tform),
    kwargs=dict(output_shape=255, 255),
    n_warmup=10,
    n_repeat=10000,
    max_duration=5)  # cap at 5 seconds duration
print(f"warp: avg GPU time = {perf_gpu.gpu_times.mean()}")

@JHancox
Copy link
Author

JHancox commented Feb 16, 2024

Thanks for the details @grlee77. In this case the image was 256 x 256 but I will try larger images and see what happens. Thanks for the tip on the timeit - you are quite right. Often there is some implicit mem synch operation involved anyhow, but I should be explicit about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants