[WIP/RFC]: Refactor CUDA code out of cuCIM #464

jakirkham · 2022-12-06T05:00:04Z

(Tries to) address #375

This is a very rough attempt at starting to refactor CUDA code out of Python strs and into CUDA files. Starting with just one function to see how this goes and work through any issues that come up. If it is reasonable and proves viable, this could be expanded upon.

jakirkham · 2022-12-06T09:28:33Z

python/cucim/src/cucim/skimage/measure/_moments.py

-    result[0] = myy / mu0;
-    result[1] = result[2] = -mxy / mu0;
-    result[3] = mxx / mu0;
+    inertia_tensor_2x2(&mu[0], &result[0]);


FWICT CuPy actually passes CArray objects into the kernel. These can be indexed though, which in the C-contiguous case indexes directly into the data_ pointer. We leverage this here to back out the pointer to underlying array data, which the kernel then takes for normal operation.

as long as we make sure to check that the inputs are C-contiguous, this seems okay

There is a fallback for the non-contiguous case

That said, the C code effectively assumes the data is contiguous somehow (really C-contiguous). We could require that before calling the kernel. Alternatively we could rewrite the kernel so it is less reliant on this requirement. Though not sure how often more complex cases are needed given this seems to be an internal function. Please feel free to correct me if I'm missing something

python/cucim/src/cucim/skimage/measure/_moments.py

grlee77 · 2022-12-06T14:58:24Z

python/cucim/src/cucim/skimage/measure/_moments.py

@@ -519,20 +520,19 @@ def centroid(image, *, spacing=None):


 def _get_inertia_tensor_2x2_kernel():


Suggested change

def _get_inertia_tensor_2x2_kernel():

@cp.memoize(for_each_device=True)

def _get_inertia_tensor_2x2_kernel():

Would we still want to do this if we get #include to work (thus not needing to read the file)?

We are now using #include. Do we still want this or should we skip it?

I would be okay with removing it, given that it should now be <1 microsecond benefit to having it. No strong opinion either way in this case.

grlee77 · 2022-12-06T15:19:22Z

This seems reasonable for quite a few of the kernels we have. Would you want to use this approach for even more trivial cases like the following two?:

cucim/python/cucim/src/cucim/skimage/feature/corner.py

Lines 357 to 377 in 6c50a4e

    
           @cp.memoize(for_each_device=True) 
        
           def _get_real_symmetric_2x2_det_kernel(): 
        
               return cp.ElementwiseKernel( 
        
                   in_params="F M00, F M01, F M11", 
        
                   out_params="F det", 
        
                   operation="det = M00 * M11 - M01 * M01;", 
        
                   name="cucim_skimage_symmetric_det22_kernel") 
        
           @cp.memoize(for_each_device=True) 
        
           def _get_real_symmetric_3x3_det_kernel(): 
        
               operation = """ 
        
               det = M00 * (M11 * M22 - M12 * M12); 
        
               det -= M01 * (M01 * M22 - M12 * M02); 
        
               det += M02 * (M01 * M12 - M11 * M02); 
        
               """ 
        
               return cp.ElementwiseKernel( 
        
                   in_params="F M00, F M01, F M02, F M11, F M12, F M22", 
        
                   out_params="F det", 
        
                   operation=operation, 
        
                   name="cucim_skimage_symmetric_det33_kernel")

A slightly more involved example would be the following where there are two parameters that affect the generated code. The first part of the kernel is always the same, but the final portion can sort the eigenvalues in ascending or descending value, optionally via absolute values.

cucim/python/cucim/src/cucim/skimage/feature/corner.py

Lines 534 to 615 in 6c50a4e

    
           @cp.memoize(for_each_device=True) 
        
           def _get_real_symmetric_3x3_eigvals_kernel(sort='ascending', abs_sort=False): 
        
               operation = """ 
        
               double x1, x2, phi; 
        
               double a = static_cast<double>(aa); 
        
               double b = static_cast<double>(bb); 
        
               double c = static_cast<double>(cc); 
        
               double d = static_cast<double>(dd); 
        
               double e = static_cast<double>(ee); 
        
               double f = static_cast<double>(ff); 
        
               double d_sq = d * d; 
        
               double e_sq = e * e; 
        
               double f_sq = f * f; 
        
               double tmpa = (2*a - b - c); 
        
               double tmpb = (2*b - a - c); 
        
               double tmpc = (2*c - a - b); 
        
               x2 = - tmpa * tmpb * tmpc; 
        
               x2 += 9 * (tmpc*d_sq + tmpb*f_sq + tmpa*e_sq); 
        
               x2 -= 54 * (d * e * f); 
        
               x1 = a*a + b*b + c*c - a*b - a*c - b*c + 3 * (d_sq + e_sq + f_sq); 
        
               // grlee77: added max() here for numerical stability 
        
               // (avoid NaN values in ridge filter test cases) 
        
               x1 = max(x1, 0.0); 
        
               if (x2 == 0.0) { 
        
                   phi = M_PI / 2.0; 
        
               } else { 
        
                   // grlee77: added max() here for numerical stability 
        
                   // (avoid NaN values in test_hessian_matrix_eigvals_3d) 
        
                   double arg = max(4*x1*x1*x1 - x2*x2, 0.0); 
        
                   phi = atan(sqrt(arg)/x2); 
        
                   if (x2 < 0) { 
        
                       phi += M_PI; 
        
                   } 
        
               } 
        
               double x1_term = (2.0 / 3.0) * sqrt(x1); 
        
               double abc = (a + b + c) / 3.0; 
        
               lam1 = abc - x1_term * cos(phi / 3.0); 
        
               lam2 = abc + x1_term * cos((phi - M_PI) / 3.0); 
        
               lam3 = abc + x1_term * cos((phi + M_PI) / 3.0); 
        
               """ 
        
               sort_template = """ 
        
                   F stmp; 
        
                   if ({prefix}{var1} > {prefix}{var2}) {{ 
        
                       stmp = {var2}; 
        
                       {var2} = {var1}; 
        
                       {var1} = stmp; 
        
                   }} if ({prefix}{var1} > {prefix}{var3}) {{ 
        
                       stmp = {var3}; 
        
                       {var3} = {var1}; 
        
                       {var1} = stmp; 
        
                   }} if ({prefix}{var2} > {prefix}{var3}) {{ 
        
                       stmp = {var3}; 
        
                       {var3} = {var2}; 
        
                       {var2} = stmp; 
        
                   }} 
        
               """ 
        
               if abs_sort: 
        
                   operation += """ 
        
                   F abs_lam1 = abs(lam1); 
        
                   F abs_lam2 = abs(lam2); 
        
                   F abs_lam3 = abs(lam3); 
        
                   """ 
        
                   prefix = "abs_" 
        
               else: 
        
                   prefix = "" 
        
               if sort == 'ascending': 
        
                   var1 = "lam1" 
        
                   var3 = "lam3" 
        
               elif sort == 'descending': 
        
                   var1 = "lam3" 
        
                   var3 = "lam1" 
        
               operation += sort_template.format( 
        
                   prefix=prefix, var1=var1, var2="lam2", var3=var3 
        
               ) 
        
               return cp.ElementwiseKernel( 
        
                   in_params="F aa, F bb, F cc, F dd, F ee, F ff", 
        
                   out_params="F lam1, F lam2, F lam3", 
        
                   operation=operation, 
        
                   name="cucim_skimage_symmetric_eig33_kernel")

The most complicated are in the skimage/_vendored folder such as this one used by transforms like rotate, warp, resize, etc.:

cucim/python/cucim/src/cucim/skimage/_vendored/_ndimage_interp_kernels.py

Lines 226 to 510 in 6c50a4e

    
           def _generate_interp_custom(coord_func, ndim, large_int, yshape, mode, cval, 
        
                                       order, name='', integer_output=False, nprepad=0, 
        
                                       omit_in_coord=False): 
        
               """ 
        
               Args: 
        
                   coord_func (function): generates code to do the coordinate 
        
                       transformation. See for example, `_get_coord_shift`. 
        
                   ndim (int): The number of dimensions. 
        
                   large_int (bool): If true use Py_ssize_t instead of int for indexing. 
        
                   yshape (tuple): Shape of the output array. 
        
                   mode (str): Signal extension mode to use at the array boundaries 
        
                   cval (float): constant value used when `mode == 'constant'`. 
        
                   name (str): base name for the interpolation kernel 
        
                   integer_output (bool): boolean indicating whether the output has an 
        
                       integer type. 
        
                   nprepad (int): integer indicating the amount of prepadding at the 
        
                       boundaries. 
        
               Returns: 
        
                   operation (str): code body for the ElementwiseKernel 
        
                   name (str): name for the ElementwiseKernel 
        
               """ 
        
               ops = [] 
        
               internal_dtype = 'double' if integer_output else 'Y' 
        
               ops.append(f'{internal_dtype} out = 0.0;') 
        
               if large_int: 
        
                   uint_t = 'size_t' 
        
                   int_t = 'ptrdiff_t' 
        
               else: 
        
                   uint_t = 'unsigned int' 
        
                   int_t = 'int' 
        
               # determine strides for x along each axis 
        
               for j in range(ndim): 
        
                   ops.append(f'const {int_t} xsize_{j} = x.shape()[{j}];') 
        
               ops.append(f'const {uint_t} sx_{ndim - 1} = 1;') 
        
               for j in range(ndim - 1, 0, -1): 
        
                   ops.append(f'const {uint_t} sx_{j - 1} = sx_{j} * xsize_{j};') 
        
               if not omit_in_coord: 
        
                   # create in_coords array to store the unraveled indices 
        
                   ops.append(_unravel_loop_index(yshape, uint_t)) 
        
               # compute the transformed (target) coordinates, c_j 
        
               ops = ops + coord_func(ndim, nprepad) 
        
               if cval is numpy.nan: 
        
                   cval = '(Y)CUDART_NAN' 
        
               elif cval == numpy.inf: 
        
                   cval = '(Y)CUDART_INF' 
        
               elif cval == -numpy.inf: 
        
                   cval = '(Y)(-CUDART_INF)' 
        
               else: 
        
                   cval = f'({internal_dtype}){cval}' 
        
               if mode == 'constant': 
        
                   # use cval if coordinate is outside the bounds of x 
        
                   _cond = ' || '.join( 
        
                       [f'(c_{j} < 0) || (c_{j} > xsize_{j} - 1)' for j in range(ndim)]) 
        
                   ops.append(f''' 
        
                   if ({_cond}) 
        
                   {{ 
        
                       out = {cval}; 
        
                   }} 
        
                   else 
        
                   {{''') 
        
               if order == 0: 
        
                   if mode == 'wrap': 
        
                       ops.append('double dcoord;')  # mode 'wrap' requires this to work 
        
                   for j in range(ndim): 
        
                       # determine nearest neighbor 
        
                       if mode == 'wrap': 
        
                           ops.append(f''' 
        
                           dcoord = c_{j};''') 
        
                       else: 
        
                           ops.append(f''' 
        
                           {int_t} cf_{j} = ({int_t})floor((double)c_{j} + 0.5);''') 
        
                       # handle boundary 
        
                       if mode != 'constant': 
        
                           if mode == 'wrap': 
        
                               ixvar = 'dcoord' 
        
                               float_ix = True 
        
                           else: 
        
                               ixvar = f'cf_{j}' 
        
                               float_ix = False 
        
                           ops.append( 
        
                               _util._generate_boundary_condition_ops( 
        
                                   mode, ixvar, f'xsize_{j}', int_t, float_ix)) 
        
                           if mode == 'wrap': 
        
                               ops.append(f''' 
        
                           {int_t} cf_{j} = ({int_t})floor(dcoord + 0.5);''') 
        
                       # sum over ic_j will give the raveled coordinate in the input 
        
                       ops.append(f''' 
        
                       {int_t} ic_{j} = cf_{j} * sx_{j};''') 
        
                   _coord_idx = ' + '.join([f'ic_{j}' for j in range(ndim)]) 
        
                   if mode == 'grid-constant': 
        
                       _cond = ' || '.join([f'(ic_{j} < 0)' for j in range(ndim)]) 
        
                       ops.append(f''' 
        
                       if ({_cond}) {{ 
        
                           out = {cval}; 
        
                       }} else {{ 
        
                           out = ({internal_dtype})x[{_coord_idx}]; 
        
                       }}''') 
        
                   else: 
        
                       ops.append(f''' 
        
                       out = ({internal_dtype})x[{_coord_idx}];''') 
        
               elif order == 1: 
        
                   for j in range(ndim): 
        
                       # get coordinates for linear interpolation along axis j 
        
                       ops.append(f''' 
        
                       {int_t} cf_{j} = ({int_t})floor((double)c_{j}); 
        
                       {int_t} cc_{j} = cf_{j} + 1; 
        
                       {int_t} n_{j} = (c_{j} == cf_{j}) ? 1 : 2;  // points needed 
        
                       ''') 
        
                       if mode == 'wrap': 
        
                           ops.append(f''' 
        
                           double dcoordf = c_{j}; 
        
                           double dcoordc = c_{j} + 1;''') 
        
                       else: 
        
                           # handle boundaries for extension modes. 
        
                           ops.append(f''' 
        
                           {int_t} cf_bounded_{j} = cf_{j}; 
        
                           {int_t} cc_bounded_{j} = cc_{j};''') 
        
                       if mode != 'constant': 
        
                           if mode == 'wrap': 
        
                               ixvar = 'dcoordf' 
        
                               float_ix = True 
        
                           else: 
        
                               ixvar = f'cf_bounded_{j}' 
        
                               float_ix = False 
        
                           ops.append( 
        
                               _util._generate_boundary_condition_ops( 
        
                                   mode, ixvar, f'xsize_{j}', int_t, float_ix)) 
        
                           ixvar = 'dcoordc' if mode == 'wrap' else f'cc_bounded_{j}' 
        
                           ops.append( 
        
                               _util._generate_boundary_condition_ops( 
        
                                   mode, ixvar, f'xsize_{j}', int_t, float_ix)) 
        
                           if mode == 'wrap': 
        
                               ops.append( 
        
                                   f''' 
        
                               {int_t} cf_bounded_{j} = ({int_t})floor(dcoordf);; 
        
                               {int_t} cc_bounded_{j} = ({int_t})floor(dcoordf + 1);; 
        
                               ''' 
        
                               ) 
        
                       ops.append(f''' 
        
                       for (int s_{j} = 0; s_{j} < n_{j}; s_{j}++) 
        
                           {{ 
        
                               W w_{j}; 
        
                               {int_t} ic_{j}; 
        
                               if (s_{j} == 0) 
        
                               {{ 
        
                                   w_{j} = (W)cc_{j} - c_{j}; 
        
                                   ic_{j} = cf_bounded_{j} * sx_{j}; 
        
                               }} else 
        
                               {{ 
        
                                   w_{j} = c_{j} - (W)cf_{j}; 
        
                                   ic_{j} = cc_bounded_{j} * sx_{j}; 
        
                               }}''') 
        
               elif order > 1: 
        
                   if mode == 'grid-constant': 
        
                       spline_mode = 'constant' 
        
                   elif mode == 'nearest': 
        
                       spline_mode = 'nearest' 
        
                   else: 
        
                       spline_mode = _spline_prefilter_core._get_spline_mode(mode) 
        
                   # wx, wy are temporary variables used during spline weight computation 
        
                   ops.append(f''' 
        
                       W wx, wy; 
        
                       {int_t} start;''') 
        
                   for j in range(ndim): 
        
                       # determine weights along the current axis 
        
                       ops.append(f''' 
        
                       W weights_{j}[{order + 1}];''') 
        
                       ops.append(spline_weights_inline[order].format(j=j, order=order)) 
        
                       # get starting coordinate for spline interpolation along axis j 
        
                       if mode in ['wrap']: 
        
                           ops.append(f'double dcoord = c_{j};') 
        
                           coord_var = 'dcoord' 
        
                           ops.append( 
        
                               _util._generate_boundary_condition_ops( 
        
                                   mode, coord_var, f'xsize_{j}', int_t, True)) 
        
                       else: 
        
                           coord_var = f'(double)c_{j}' 
        
                       if order & 1: 
        
                           op_str = ''' 
        
                           start = ({int_t})floor({coord_var}) - {order_2};''' 
        
                       else: 
        
                           op_str = ''' 
        
                           start = ({int_t})floor({coord_var} + 0.5) - {order_2};''' 
        
                       ops.append( 
        
                           op_str.format( 
        
                               int_t=int_t, coord_var=coord_var, order_2=order // 2 
        
                           )) 
        
                       # set of coordinate values within spline footprint along axis j 
        
                       ops.append(f'''{int_t} ci_{j}[{order + 1}];''') 
        
                       for k in range(order + 1): 
        
                           ixvar = f'ci_{j}[{k}]' 
        
                           ops.append(f''' 
        
                           {ixvar} = start + {k};''') 
        
                           ops.append( 
        
                               _util._generate_boundary_condition_ops( 
        
                                   spline_mode, ixvar, f'xsize_{j}', int_t)) 
        
                       # loop over the order + 1 values in the spline filter 
        
                       ops.append(f''' 
        
                       W w_{j}; 
        
                       {int_t} ic_{j}; 
        
                       for (int k_{j} = 0; k_{j} <= {order}; k_{j}++) 
        
                           {{ 
        
                               w_{j} = weights_{j}[k_{j}]; 
        
                               ic_{j} = ci_{j}[k_{j}] * sx_{j}; 
        
                       ''') 
        
               if order > 0: 
        
                   _weight = ' * '.join([f'w_{j}' for j in range(ndim)]) 
        
                   _coord_idx = ' + '.join([f'ic_{j}' for j in range(ndim)]) 
        
                   if mode == 'grid-constant' or (order > 1 and mode == 'constant'): 
        
                       _cond = ' || '.join([f'(ic_{j} < 0)' for j in range(ndim)]) 
        
                       ops.append(f''' 
        
                       if ({_cond}) {{ 
        
                           out += {cval} * ({internal_dtype})({_weight}); 
        
                       }} else {{ 
        
                           {internal_dtype} val = ({internal_dtype})x[{_coord_idx}]; 
        
                           out += val * ({internal_dtype})({_weight}); 
        
                       }}''') 
        
                   else: 
        
                       ops.append(f''' 
        
                       {internal_dtype} val = ({internal_dtype})x[{_coord_idx}]; 
        
                       out += val * ({internal_dtype})({_weight});''') 
        
                   ops.append('}' * ndim) 
        
               if mode == 'constant': 
        
                   ops.append('}') 
        
               if integer_output: 
        
                   ops.append('y = (Y)rint((double)out);') 
        
               else: 
        
                   ops.append('y = (Y)out;') 
        
               operation = '\n'.join(ops) 
        
               mode_str = mode.replace('-', '_')  # avoid hyphen in kernel name 
        
               name = 'cupyx_scipy_ndimage_interpolate_{}_order{}_{}_{}d_y{}'.format( 
        
                   name, order, mode_str, ndim, '_'.join([f'{j}' for j in yshape]), 
        
               ) 
        
               if uint_t == 'size_t': 
        
                   name += '_i64' 
        
               return operation, name 
        
           @cupy._util.memoize(for_each_device=True) 
        
           def _get_map_kernel(ndim, large_int, yshape, mode, cval=0.0, order=1, 
        
                               integer_output=False, nprepad=0): 
        
               in_params = 'raw X x, raw W coords' 
        
               out_params = 'Y y' 
        
               operation, name = _generate_interp_custom( 
        
                   coord_func=_get_coord_map, 
        
                   ndim=ndim, 
        
                   large_int=large_int, 
        
                   yshape=yshape, 
        
                   mode=mode, 
        
                   cval=cval, 
        
                   order=order, 
        
                   name='map', 
        
                   integer_output=integer_output, 
        
                   nprepad=nprepad, 
        
                   omit_in_coord=True,  # input image coordinates are not needed 
        
               ) 
        
               return cupy.ElementwiseKernel(in_params, out_params, operation, name, 
        
                                             preamble=math_constants_preamble)

There, the code is generated dynamically based on many attributes such as border mode, interpolation order and the number of dimensions which makes it very flexible, but quite hard to follow. I don't see a way to easily adapt those cases to this approach.

jakirkham · 2022-12-06T22:57:43Z

This seems reasonable for quite a few of the kernels we have.

Indeed this was my hope and thinking. Glad to hear you agree :)

Would you want to use this approach for even more trivial cases like the following two?

I'm not sure. That probably also depends whether this kernel is useful in CUDA itself. Think, as you hint, it is probably not worth it for the trivial cases, but could be wrong.

For more context picked this particular kernel since it was an easy one to try this idea out on, but still with enough meat on it to learn a few things along the way. It turned out to not be that difficult. So am encouraged we may be able to explore this strategy more.

A slightly more involved example would be the following where there are two parameters that affect the generated code.

Thanks for the suggestion! Can take a closer look.

The first part of the kernel is always the same, but the final portion can sort the eigenvalues in ascending or descending value, optionally via absolute values.

Interesting. Maybe the final portion could be a second templated CUDA function called by the first?

There, the code is generated dynamically based on many attributes such as border mode, interpolation order and the number of dimensions which makes it very flexible, but quite hard to follow. I don't see a way to easily adapt those cases to this approach.

Indeed. It probably can be done. Though may need to resort to #define macros. That being said, as it is not low hanging fruit, it is probably best to save until the end or handle separately. Unless there is a compelling need to move it over to CUDA sooner.

jakirkham · 2022-12-08T01:08:29Z

rerun tests

(one of the CI jobs hung)

grlee77

Thanks, this looks good. I agree the #include approach is preferable to parsing the file in Python.

jakirkham · 2022-12-08T05:07:29Z

Thanks for giving this another look Greg 🙏

Given we have settled on a working pattern, may try to do this in a few more places

jakirkham added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Dec 6, 2022

jakirkham force-pushed the ref_c branch from a4a6f04 to 260a5e6 Compare December 6, 2022 05:01

Try refactoring some CUDA code out of cuCIM

e6da017

jakirkham force-pushed the ref_c branch from 260a5e6 to e6da017 Compare December 6, 2022 08:09

jakirkham commented Dec 6, 2022

View reviewed changes

python/cucim/src/cucim/skimage/measure/_moments.py Outdated Show resolved Hide resolved

grlee77 reviewed Dec 6, 2022

View reviewed changes

Use #include with moments.h in CuPy kernel

4672d58

jakirkham requested a review from grlee77 December 7, 2022 23:43

grlee77 approved these changes Dec 8, 2022

View reviewed changes

Merge rapidsai/branch-23.02 into jakirkham/ref_c

69d4ca4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP/RFC]: Refactor CUDA code out of cuCIM #464

[WIP/RFC]: Refactor CUDA code out of cuCIM #464

jakirkham commented Dec 6, 2022 •

edited

jakirkham Dec 6, 2022

grlee77 Dec 6, 2022

jakirkham Dec 6, 2022 •

edited

grlee77 Dec 6, 2022

jakirkham Dec 6, 2022

jakirkham Dec 7, 2022

grlee77 Dec 8, 2022

grlee77 commented Dec 6, 2022

jakirkham commented Dec 6, 2022

jakirkham commented Dec 8, 2022

grlee77 left a comment

jakirkham commented Dec 8, 2022

		@@ -519,20 +520,19 @@ def centroid(image, *, spacing=None):


		def _get_inertia_tensor_2x2_kernel():

	def _get_inertia_tensor_2x2_kernel():
	@cp.memoize(for_each_device=True)
	def _get_inertia_tensor_2x2_kernel():

[WIP/RFC]: Refactor CUDA code out of cuCIM #464

Are you sure you want to change the base?

[WIP/RFC]: Refactor CUDA code out of cuCIM #464

Conversation

jakirkham commented Dec 6, 2022 • edited

jakirkham Dec 6, 2022

Choose a reason for hiding this comment

grlee77 Dec 6, 2022

Choose a reason for hiding this comment

jakirkham Dec 6, 2022 • edited

Choose a reason for hiding this comment

grlee77 Dec 6, 2022

Choose a reason for hiding this comment

jakirkham Dec 6, 2022

Choose a reason for hiding this comment

jakirkham Dec 7, 2022

Choose a reason for hiding this comment

grlee77 Dec 8, 2022

Choose a reason for hiding this comment

grlee77 commented Dec 6, 2022

jakirkham commented Dec 6, 2022

jakirkham commented Dec 8, 2022

grlee77 left a comment

Choose a reason for hiding this comment

jakirkham commented Dec 8, 2022

jakirkham commented Dec 6, 2022 •

edited

jakirkham Dec 6, 2022 •

edited