FBP filtering on GPU #423

tsadakane · 2023-01-08T12:10:52Z

Summary

Implemented FBP filtering on GPU using cuFFT.

Test

Condition

Software	Version
Windows	10 64bit
Python(anaconda)	3.10.4
CUDA	11.8
Visual Studio	2022
GPU	GTX1060 6GB

Software	Version
Windows	10 64bit
MATLAB	2022a
CUDA	11.8
Visual Studio	2022
GPU	GTX1060 6GB

Python

import numpy as np
import time

import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag   # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)

nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))

# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter
tic = timer()
for _ in range(niter):
    fdkout = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=False)
toc = timer()
print("CPU: {}".format(toc-tic))
tic = timer()
for _ in range(niter):
    ossart = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=True)
toc = timer()
print("GPU: {}".format(toc-tic))


# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(fdkout, head, ["nRMSE"]))
print("RMSE fdk GPU: ")
print(Measure_Quality(ossart, head, ["nRMSE"]))

# Plot
fig, axes = plt.subplots(3, 3)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(fdkout[geo.nVoxel[0] // 2])
axes[1, 0].imshow(fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(fdkout[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(ossart[geo.nVoxel[0] // 2])
axes[1, 1].imshow(ossart[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(ossart[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("GPU-CPU")
axes[0, 2].imshow(ossart[geo.nVoxel[0] // 2]      -fdkout[geo.nVoxel[0] // 2])
axes[1, 2].imshow(ossart[:, geo.nVoxel[1] // 2, :]-fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(ossart[:, :, geo.nVoxel[2] // 2]-fdkout[:, :, geo.nVoxel[2] // 2])
plt.show()
# tigre.plotProj(proj)
# tigre.plotImg(fdkout)

Matlab

clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256]);                     

%% Load data and generate projections 
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated');
sprintf("size of projections = %s", mat2str(size(projections)))

niter = 10;
sprintf("niter = %d", niter)

disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',false);
end
toc
disp("GPU")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU=FDK(projections,geo,angles,'usegpufft',true);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU(:,:,50), imgFDK_CPU(:,:,50)- imgFDK_GPU(:,:,50)]);

Discussion

This code consumes more time than CPU even when the projections are size of 2048x2048.

I found that

TIGRE/Common/CUDA/FbpFiltration.cu

Lines 109 to 113 in 98fa77f

    
           for (int iV = 0; iV < uiVLen; ++iV) { 
        
               for (int iU = 0; iU < uiULen; ++iU) { 
        
                   h_pcfInOut[iU+uiULen*iV] = cufftComplex{pfIn[iU+uiULen*iV], 0}; 
        
               } 
        
           }

and

TIGRE/Common/CUDA/FbpFiltration.cu

Lines 158 to 162 in 98fa77f

    
           for (int iV = 0; iV < uiVLen; ++iV) { 
        
               for (int iU = 0; iU < uiULen; ++iU) { 
        
                   pfOut[iU+uiULen*iV] = h_pcfInOut[iU+uiULen*iV].x; 
        
               } 
        
           }

consumes time, while FFT and multiplying the filter in the Fourier space is fast enough.
These are necessary as long as we use FFT C2C.
R2C and C2R may work.

As for now, the CUDA codes are common for Matlab and Python because filtering.m transposes the matrix before calling fft.
Only one GPU is used.
Only one stream is used.

tsadakane · 2023-01-08T16:14:59Z

R2C and C2R may work.

Worked. (a39e6b2)

Results of python test code

Conditions

0: NVIDIA GeForce GTX 1060 6GB
{'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]}
proj size = (100, 1024, 1024)
niter = 10

Before

CPU: 24.03577409999707 s
GPU: 28.770307099999627 s
RMSE fdk CPU:
526.5072514306678
RMSE fdk GPU:
526.5072768954007

After

CPU: 24.188547600002494 s
GPU: 12.808391700003995 s
RMSE fdk CPU:
526.5072514306678
RMSE fdk GPU:
526.5072259659336

tsadakane · 2023-01-08T16:20:53Z

@AnderBiguri Do you think we should keep the option to use/not to use GPU for FBP filtering for filtering and FDK function?

AnderBiguri · 2023-01-08T20:47:40Z

@tsadakane Fantastic work, honestly!

I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?

I will test this on my side in a couple of machines (old/new) and report speed (sometime next week). I think if we can more or less verify that the filtering is always faster on GPU, we can just remove the CPU option from FBP/FDK. We can always leave the matlab/python code there, deprecated, for people to be able to read what the CUDA part does, but in an easier mode. What do you think?

tsadakane · 2023-01-09T13:22:51Z

@AnderBiguri Thank you for your reply. I think it would be better to remove the old code if possible. Let me know the result and I can remove the option before submitting this PR.

I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?

The CPU is i7-11700 (8 cores, 16 threads).
I found that only about 2 sec. in 12.8 sec. of GPU was creating and destroying FFT "plan"s and FFT, multipling filter and IFFT, other 7 sec. is consumed before and after calling the host function in FbpFiltration.cu, apply_filtration.

AnderBiguri · 2023-01-10T09:03:38Z

@tsadakane fantastic, I think its not a bad idea, it will live in git's history anyway, so you are right.
I will test on a few different machines to ensure we can reliably say the GPU is faster all the time.

In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?

AnderBiguri · 2023-01-10T09:52:57Z

In my i7-7700HQ/GTX 1070 laptop, GPU is often slower:

MATLAB:

geo.nDetect 256^2, nagles=100, niter=10:

CPU
Elapsed time is 6.132782 seconds.
GPU
Elapsed time is 8.440495 seconds.

geo.nDetect 512^2, nagles=100, niter=10:

CPU
Elapsed time is 16.697335 seconds.
GPU
Elapsed time is 18.622791 seconds.

geo.nDetect 1024^2, nagles=100, niter=10:

CPU
Elapsed time is 54.229799 seconds.
GPU
Elapsed time is 51.912400 seconds.

I still am quite surprised by this performance, I really did think it would be much faster. I'll try to dig a bit further, I am not sure if there is something extra we can optimize in the code that we may not be aware of, or if my assumption that it would be much faster is seriously flawed.

tsadakane · 2023-01-10T11:14:24Z

In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?

Sorry for unclear English.
I meant...

2 sec. is FFT, multiplying filter, and IFFT. (*)
7 sec. is python<->cuda interoperability stuff. (**)
other is cuda memory allocation and transfer.

(*) In apply_filtration, commenting out the lines of creating and destroying "plan", which nvidia calls the FFT object, and the lines in the curly bracket, the elapsed time reduced 2sec from 12sec. (became 10sec).
In the 2 sec., 0.7 sec is for FFT, multiplying filter and IFFT, and 1.4 sec. is for creating/destroying the plan.

(**) Returning at the beginning of apply_filtration, the time became 7sec from 12sec.

AnderBiguri · 2023-01-10T11:40:26Z

@tsadakane I see, thanks. So if we were to try to optimize this we'd need to look at the python<->cuda interoperability. I need to sit down and read the code more carefully, as this seems surprisingly high? What do you think?

tsadakane · 2023-01-10T15:21:48Z

I have no idea so far.

Commenting out all but the last line of filtering function in filtering.py, which is called by FDK before Atb is called, the time became 4sec.
In the filtering function in fintering.py, I modified as

...
    if use_gpu:
        bundle_size = 32
        fproj=np.empty((bundle_size, geo.nDetector[0],filt_len),dtype=np.float32)
        for idx_bundle in range(0,(angles.shape[0]+bundle_size-1)//bundle_size):
            fproj.fill(0)
            for k in range(0, bundle_size):
                if idx_bundle*bundle_size+k < angles.shape[0]:
                    fproj[k, :,padding:padding+geo.nDetector[1]]=proj[idx_bundle*bundle_size+k]

            fproj = fproj.reshape([bundle_size*geo.nDetector[0], filt_len])
            fproj = fbpfiltration.apply(fproj, filt, gpuids)

            fproj = fproj.reshape([bundle_size, geo.nDetector[0], filt_len])
            for k in range(0, bundle_size):
                if idx_bundle*bundle_size+k < angles.shape[0]:
                    proj[idx_bundle*bundle_size+k]=fproj[k, :,padding:padding+geo.nDetector[1]] * scale_factor
    else:
...

to reduce the number of calling cuda function to 1/32, but the total time did not change (12s).

tsadakane · 2023-01-12T15:15:55Z

I wrote multi-gpu version and cudaMemcpyAsync version, but as expected, the both were no effect.

tsadakane · 2023-01-12T15:16:02Z

Just removing padding from filtering function in filtering.py reduced 3 sec., so I moved the padding into cuda function.
This reduces the amount of memory to transfer. For effective strided data copy, I used cudaMemcpy2D, then GPU is now three times faster than GPU.

0: NVIDIA GeForce GTX 1060 6GB
{'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]}
proj size = (100, 1024, 1024)
niter = 10
CPU: 23.939920199998596
GPU: 7.977493500002311
RMSE fdk CPU: 
526.5072514306678
RMSE fdk GPU: 
526.5072514306678

filtering.py will be as

def filtering(proj, geo, angles, parker, verbose=False, use_gpu=True, gpuids=None):
    if parker:
        proj=parkerweight(proj.transpose(0,2,1),geo,angles,parker).transpose(0,2,1)

    filt_len=max(64,2**nextpow2(2*max(geo.nDetector)))
    ramp_kernel=ramp_flat(filt_len)

    d=1
    filt=filter(geo.filter,ramp_kernel[0],filt_len,d,verbose=verbose)

    padding = int((filt_len-geo.nDetector[1])//2 )
    scale_factor = (geo.DSD[0]/geo.DSO[0]) * (2 * np.pi/ len(angles)) / ( 4 * geo.dDetector[0] ) 
    if use_gpu:
        bundle_size = 32 #len(gpuids)
        n_bundles = (angles.shape[0]+bundle_size-1)//bundle_size
        for idx_bundle in range(0,n_bundles):
            bundle_size_actual = bundle_size if idx_bundle != n_bundles-1 else angles.shape[0]-idx_bundle*bundle_size
            idx_begin = idx_bundle*bundle_size
            idx_end = idx_bundle*bundle_size+bundle_size_actual
            proj[idx_begin:idx_end] = fbpfiltration.apply2(proj, idx_begin, idx_end, filt, scale_factor, gpuids) 

    else:
        filt=np.kron(np.ones((np.int64(geo.nDetector[0]),1),dtype=np.float32),filt)

        #filter 2 projection at a time packing in to complex container
        fproj=np.empty((geo.nDetector[0],filt_len),dtype=np.complex64)
        for i in range(0,angles.shape[0]-1,2):
            fproj.fill(0)
            fproj.real[:,padding:padding+geo.nDetector[1]]=proj[i]
            fproj.imag[:,padding:padding+geo.nDetector[1]]=proj[i+1]

            fproj=fft(fproj,axis=1)
            fproj=fproj*filt
            fproj=ifft(fproj,axis=1)

            proj[i]=fproj.real[:,padding:padding+geo.nDetector[1]] * scale_factor
            proj[i+1]=fproj.imag[:,padding:padding+geo.nDetector[1]] * scale_factor

        #if odd number of projections filter last solo
        if angles.shape[0] % 2:
            fproj.fill(0)
            fproj.real[:,padding:padding+geo.nDetector[1]]=proj[angles.shape[0]-1]

            fproj=fft(fproj,axis=1)
            fproj=fproj*filt
            fproj=np.real(ifft(fproj,axis=1))     
            proj[angles.shape[0]-1]=fproj[:,padding:padding+geo.nDetector[1]] * scale_factor

    return proj

I'm worried that by moving "padding" to cuda, the "if GPU" code is not parallel to CPU. What do you think?

I have not yet run this modification on Matlab and not yet commited.

AnderBiguri · 2023-01-13T10:19:43Z

@tsadakane interesting!

Sorry, I am a bit busy this week, perhaps next week, so I don't have time to look at it with the detail it deserves.
I intend to have a serious look at it as soon as I can.

tsadakane · 2023-01-21T10:44:51Z

About `aa51ec6`

"CPU" uses CPU for convolution.
In "GPU 1", the convolution is processed in a GPU, even if gpuids includes multiple GPUs. (=a39e6b2)
In "GPU 2", the padding and convolution are processed in GPUs.

Tests

Python

test code

import numpy as np
import time

import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag   # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)

nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))

# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter

tic = timer()
for _ in range(niter):
    out_cpu = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=0)
toc = timer()
print("CPU  : {}".format(toc-tic))
tic = timer()
for _ in range(niter):
    out_gpu1 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=1)
toc = timer()
print("GPU 1: {}".format(toc-tic))

tic = timer()
for _ in range(niter):
    out_gpu2 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=2)
toc = timer()
print("GPU 2: {}".format(toc-tic))

# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(out_cpu, head, ["nRMSE"]))
print("RMSE fdk GPU 1: ")
print(Measure_Quality(out_gpu1, head, ["nRMSE"]))
print("RMSE fdk GPU 2: ")
print(Measure_Quality(out_gpu2, head, ["nRMSE"]))

# Plot
fig, axes = plt.subplots(3, 5)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(out_cpu[geo.nVoxel[0] // 2])
axes[1, 0].imshow(out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(out_gpu1[geo.nVoxel[0] // 2])
axes[1, 1].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("FDK FFT on GPU 2")
axes[0, 2].imshow(out_gpu2[geo.nVoxel[0] // 2])
axes[1, 2].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2])
axes[0, 3].set_title("GPU1-CPU")
axes[0, 3].imshow(out_gpu1[geo.nVoxel[0] // 2]      -out_cpu[geo.nVoxel[0] // 2])
axes[1, 3].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 3].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 4].set_title("GPU2-CPU")
axes[0, 4].imshow(out_gpu2[geo.nVoxel[0] // 2]      -out_cpu[geo.nVoxel[0] // 2])
axes[1, 4].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 4].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
plt.show()

Result

Software	Version
Windows	10 64bit
Python(Intel)	3.9.15
CUDA	11.8
Visual Studio	2022
GPU	2x NVIDIA GeForce RTX 2080 Ti

0: NVIDIA GeForce RTX 2080 Ti
1: NVIDIA GeForce RTX 2080 Ti
2: NVIDIA GeForce GTX 1070
{'name': 'NVIDIA GeForce RTX 2080 Ti', 'devices': [0, 1]}
proj size = (100, 1024, 1024)
niter = 10
[path to tigre working dir.]\Python\tigre\utilities\filtering.py:80: RuntimeWarning: divide by zero encountered in true_divide
  h[odd] = -1 / (np.pi * nn[odd]) ** 2
CPU  : 35.7683762
GPU 1: 19.560339300000003
GPU 2: 12.197235299999996
RMSE fdk CPU: 
526.5050614590306
RMSE fdk GPU 1: 
526.504934134818
RMSE fdk GPU 2: 
526.504934134818

Malab

Software	Version
Windows	10 64bit
MATLAB	2022a
CUDA	11.8
Visual Studio	2022
GPU	2x NVIDIA GeForce RTX 2080 Ti

Test code

clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256], "nDetector", [1024;1024]);                     

%% GPUs
gpuids = GpuIds('NVIDIA GeForce RTX 2080 Ti');
disp(gpuids);
disp(gpuids.devices);

%% Load data and generate projections 
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated', 'gpuids', gpuids);
sprintf("size of projections = %s", mat2str(size(projections)))

niter = 10;%10;
sprintf("niter = %d", niter)

disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',0, 'gpuids', gpuids);
end
toc
disp("GPU 1")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU1=FDK(projections,geo,angles,'usegpufft',1, 'gpuids', gpuids);
end
toc
disp("GPU 2")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU2=FDK(projections,geo,angles,'usegpufft',2, 'gpuids', gpuids);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU1(:,:,50), imgFDK_GPU2(:,:,50), abs(imgFDK_GPU1(:,:,50) - imgFDK_CPU(:,:,50)), abs(imgFDK_GPU2(:,:,50)- imgFDK_CPU(:,:,50))]);
%imshow([imgFDK_GPU(:,:,2)]);

Result

  GpuIds with properties:

       name: 'NVIDIA GeForce RTX 2080 Ti'
    devices: [0 1]
   0   1
ans = 
    "size of projections = [1024 1024 100]"
ans = 
    "niter = 10"
CPU
Elapsed time is 35.826591 seconds.
GPU 1
Elapsed time is 36.107104 seconds.
GPU 2
Elapsed time is 29.146955 seconds.

tsadakane · 2023-01-21T11:58:35Z

The table below is the summary of the results above.
The result of python is acceptable for me, but the result of matlab is hard to understand. At first, I suspected that only one GPU is used due to a bug, but when I checked, deviceCount in apply_filtration2 function was 2.

	Python [s]	Matlab [s]
CPU	35.8	35.8
GPU 1 (Filtering only)	19.6	36.1
GPU 2 (Padding and Filtering)	12.2	29.1

Here, two GPUs are used.

tsadakane · 2023-01-22T03:11:15Z

I ran the python test program above after adding gpuids.devices.pop(-1) to use only one GPU.

Python

	2x GPU [s],	1x GPU [s]
CPU	35.9	33.8
GPU 1 (Filtering only)	19.2	16.8
GPU 2 (Padding and Filtering)	12.1	10.4

It seems that the overhead of the copying is larger than the benefit of doubling the GPU resource.

AnderBiguri · 2023-01-25T16:46:50Z

@tsadakane that makes sense. Stil surprised about the times, but I think its my own false assumptions of how fast it should be. Still really good :)

AnderBiguri · 2023-01-26T14:10:18Z

I am also a bit baffled by the MATLAB slowness, maybe something to do with extra copies? let me check.

tsadakane · 2023-01-26T15:16:45Z

I think matlab code (copies &) transposes the matrix; the fastest variable in the memory is v in matlab while in python it is u, but in the cuda code, there is no difference between matlab and python.

AnderBiguri · 2023-01-26T15:21:58Z

@tsadakane ah! We may be able to do then the same that we do with the other cuda code? #if IS_FOR_MATLAB_TIGRE kind of thing? I am not entirely sure where this is applied in the new CUDA code, happy to have a look if you want, but if you know where it happens in CUDA, feel free to do it yourself.

Thanks again, your work is really appreciated!

tsadakane · 2023-01-28T01:56:35Z

I can do it, but I am not sure if it works, because cuda is not good at transposing.

It was not necessary to transpose the projection in ax and atb, because it is copied to the 3d texture, where there is no difference in the memory access efficiency between u and v axes. On the other hand, the convolution must be applied to the u-direction, so transposing in the memory layout is necessary.

That's said, I think, from the point of view of symmetry, permutation line should be moved to the filtering function if possible, i.e

TIGRE/MATLAB/Algorithms/FDK.m

Lines 61 to 77 in f9a38ed

    
           %% Weight 
        
           proj=permute(proj,[2 1 3]); 
        
           for ii=1:size(angles,2) 
        
               us = ((-geo.nDetector(1)/2+0.5):1:(geo.nDetector(1)/2-0.5))*geo.dDetector(1) + offset(1,ii); 
        
               vs = ((-geo.nDetector(2)/2+0.5):1:(geo.nDetector(2)/2-0.5))*geo.dDetector(2) + offset(2,ii); 
        
               [uu,vv] = meshgrid(us,vs); % detector 
        
               % Create weight according to each detector element 
        
               w = (geo.DSD(ii))./sqrt((geo.DSD(ii))^2+uu.^2 + vv.^2); 
        
               % Multiply the weights with projection data 
        
               proj(:,:,ii) = proj(:,:,ii).*w'; 
        
           end 
        
           %% Fourier transform based filtering 
        
           proj = filtering(proj,geo,angles,parker); % Not sure if offsets are good in here

=>

%% Weight
% proj=permute(proj,[2 1 3]);  %%%%%%%%%% Detele
for ii=1:size(angles,2)
    us = ((-geo.nDetector(1)/2+0.5):1:(geo.nDetector(1)/2-0.5))*geo.dDetector(1) + offset(1,ii);
    vs = ((-geo.nDetector(2)/2+0.5):1:(geo.nDetector(2)/2-0.5))*geo.dDetector(2) + offset(2,ii);
    [uu,vv] = meshgrid(us,vs); % detector
    
    % Create weight according to each detector element
    w = (geo.DSD(ii))./sqrt((geo.DSD(ii))^2+uu.^2 + vv.^2);
    
    % Multiply the weights with projection data
    proj(:,:,ii) = w.*proj(:,:,ii);   %%%%%%%%%% Transpose
end
%% Fourier transform based filtering
%proj=permute(proj,[2 1 3]);   %%%%%%%%%% Add this line at the beginning of the filtering function.
proj = filtering(proj,geo,angles,parker); % Not sure if offsets are good in here

This modification makes the code a little bit beautiful, but the specification of the filtering function will be changed (the first arg proj is not transposed).

What do you think?

AnderBiguri · 2023-01-29T20:57:45Z

@tsadakane I see what you mean with the filtering. My convolution/fft maths are a bit rusty, so I may be completely wrong, but isn't there a way to change the shape of the filter itself so it is equivalent to transposing the projections? That would be the ideal solution, if possible, right?

About changing the transpose to inside filtering, 100% agree is much elegant.

AnderBiguri · 2023-03-30T17:04:32Z

Hi @tsadakane, sorry I have been very busy with my life. How is this going? do you need any help? Is it done?

tsadakane · 2023-03-30T23:33:29Z

Hi @AnderBiguri, I've been busy lately too. I've implemented the code to transpose matrices, but I am not satisfied with the result and trying several granularities of processing to pass to mex/cuda.

* Added Common files

* Modified and added Matlab files

* Modified and added Python files

* Use cufftExecR2C and cufftExecC2R to omit slow copy.

* Padding and FBP filtration are applied in `apply_filtration2` function

* Modified matlab files

* Modified python files

* Moved transposing projections into filtering

AnderBiguri · 2023-06-05T13:58:12Z

Aparently making the fft and ifft calls in python have the workers argument can accelerate them. fproj = fft(fproj, axis=1, workers=cores), given cores=os.cpu_count(). Haven't tried it myself.

tsadakane · 2023-06-06T13:57:49Z

Under the condition of

proj size = (100, 1024, 1024)
nVoxel = [256, 512, 512]

On 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz (8 physical core + hyper-threading => cores=16) + GTX1060

0, CPU: Padding & FFT       : 26.59433160000026 s @ 10 times
1, CPU: Multi-Workers       : 15.725976400000036 s @ 10 times

Not bad.

AnderBiguri · 2023-06-06T19:20:27Z

I think the GPU implementation can be valuable too. I think the fastest FDK would be one that does interleave filtering and backprojection. I read a paper about it at some point.

AnderBiguri · 2024-02-29T10:32:08Z

Hi @tsadakane
What do you think about all this? worth merging it for the cases when its faster?

tsadakane · 2024-02-29T15:54:31Z

Hi, @AnderBiguri ,

It's difficult to answer, but I'd like to say no.

Offloading a light weight task like FFT is not worth for the cost of copying back and forth between CPU and GPU (and plus transposing if its run from MATLAB).
Using fft's workers argument of python can achieve speedups comparable to GPU offloading.
TIGRE's flexibility will be impaired.

AnderBiguri · 2024-02-29T15:57:12Z

@tsadakane makes sense!

Flexibility is a bit less important because we can add a flag with default CPU, but I think the rest of the points make total sense.
Indeed, this would make sense if there was a FFT+Backproject in the same CUDA call, but then the multi-GPU code gets complex.

It may be worth leaving the PR here (if you are happy with that), in case anyone in the future wants to use the code?

tsadakane force-pushed the enh_fbp-filtering-on-gpu branch from 98fa77f to 262ac8f Compare January 8, 2023 15:01

tsadakane added 6 commits May 29, 2023 22:16

Add CUDA implementation of FBP filtering

2e76574

* Added Common files

Add CUDA implementation of FBP filtering

235c005

* Modified and added Matlab files

Add CUDA implementation of FBP filtering

249c7a7

* Modified and added Python files

Optimise FbpFiltration.cu

3e9bc03

* Use cufftExecR2C and cufftExecC2R to omit slow copy.

Add another imlementation for CUDA FBP filtration

c0245a4

* Padding and FBP filtration are applied in `apply_filtration2` function

Add another imlementation for CUDA FBP filtration

054de44

* Modified matlab files

tsadakane added 4 commits May 29, 2023 22:20

Add another imlementation for CUDA FBP filtration

18c7680

* Modified python files

Make FPD filtering support multiple GPUs

46157e4

Refactoring Matlab FDK and FBP

aed4379

* Moved transposing projections into filtering

Fix FDK option usegpufft option not working

7c97471

tsadakane force-pushed the enh_fbp-filtering-on-gpu branch from bb7e0d7 to 7c97471 Compare May 29, 2023 13:28

AnderBiguri mentioned this pull request Sep 11, 2023

Conflict with CUDA Context after modifying FDK filtering step with CuPy and calling tigre.Atb() #484

Closed

AnderBiguri mentioned this pull request Sep 19, 2023

Why the Tigre module doesn't use GPU fully 100%? #488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FBP filtering on GPU #423

FBP filtering on GPU #423

tsadakane commented Jan 8, 2023 •

edited

tsadakane commented Jan 8, 2023 •

edited

tsadakane commented Jan 8, 2023

AnderBiguri commented Jan 8, 2023

tsadakane commented Jan 9, 2023

AnderBiguri commented Jan 10, 2023

AnderBiguri commented Jan 10, 2023

tsadakane commented Jan 10, 2023

AnderBiguri commented Jan 10, 2023

tsadakane commented Jan 10, 2023

tsadakane commented Jan 12, 2023

tsadakane commented Jan 12, 2023

AnderBiguri commented Jan 13, 2023

tsadakane commented Jan 21, 2023 •

edited

tsadakane commented Jan 21, 2023 •

edited

tsadakane commented Jan 22, 2023 •

edited

AnderBiguri commented Jan 25, 2023

AnderBiguri commented Jan 26, 2023

tsadakane commented Jan 26, 2023

AnderBiguri commented Jan 26, 2023

tsadakane commented Jan 28, 2023

AnderBiguri commented Jan 29, 2023

AnderBiguri commented Mar 30, 2023

tsadakane commented Mar 30, 2023

AnderBiguri commented Jun 5, 2023

tsadakane commented Jun 6, 2023

AnderBiguri commented Jun 6, 2023

AnderBiguri commented Feb 29, 2024

tsadakane commented Feb 29, 2024

AnderBiguri commented Feb 29, 2024

	for (int iV = 0; iV < uiVLen; ++iV) {
	for (int iU = 0; iU < uiULen; ++iU) {
	h_pcfInOut[iU+uiULeniV] = cufftComplex{pfIn[iU+uiULeniV], 0};
	}
	}

	for (int iV = 0; iV < uiVLen; ++iV) {
	for (int iU = 0; iU < uiULen; ++iU) {
	pfOut[iU+uiULeniV] = h_pcfInOut[iU+uiULeniV].x;
	}
	}

FBP filtering on GPU #423

Are you sure you want to change the base?

FBP filtering on GPU #423

Conversation

tsadakane commented Jan 8, 2023 • edited

Summary

Test

Condition

Python

Matlab

Discussion

tsadakane commented Jan 8, 2023 • edited

Results of python test code

Conditions

Before

After

tsadakane commented Jan 8, 2023

AnderBiguri commented Jan 8, 2023

tsadakane commented Jan 9, 2023

AnderBiguri commented Jan 10, 2023

AnderBiguri commented Jan 10, 2023

tsadakane commented Jan 10, 2023

AnderBiguri commented Jan 10, 2023

tsadakane commented Jan 10, 2023

tsadakane commented Jan 12, 2023

tsadakane commented Jan 12, 2023

AnderBiguri commented Jan 13, 2023

tsadakane commented Jan 21, 2023 • edited

About aa51ec6

Tests

Python

test code

Result

Malab

Test code

Result

tsadakane commented Jan 21, 2023 • edited

tsadakane commented Jan 22, 2023 • edited

AnderBiguri commented Jan 25, 2023

AnderBiguri commented Jan 26, 2023

tsadakane commented Jan 26, 2023

AnderBiguri commented Jan 26, 2023

tsadakane commented Jan 28, 2023

AnderBiguri commented Jan 29, 2023

AnderBiguri commented Mar 30, 2023

tsadakane commented Mar 30, 2023

AnderBiguri commented Jun 5, 2023

tsadakane commented Jun 6, 2023

AnderBiguri commented Jun 6, 2023

AnderBiguri commented Feb 29, 2024

tsadakane commented Feb 29, 2024

AnderBiguri commented Feb 29, 2024

tsadakane commented Jan 8, 2023 •

edited

tsadakane commented Jan 8, 2023 •

edited

tsadakane commented Jan 21, 2023 •

edited

About `aa51ec6`

tsadakane commented Jan 21, 2023 •

edited

tsadakane commented Jan 22, 2023 •

edited