Adding GPU acceleration to encode_jpeg #8391

deekay42 · 2024-04-23T05:20:10Z

Summary:
I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise. Additionally, I'm adding a new function torchvision.io.encode_jpegs (plural) with uses a fused kernel and may be faster than successive calls to the singular version which incurs kernel launch overhead for each call. If it's alright, I'll be happy to refactor decode_jpeg to follow this convention in a follow up PR.

Test Plan:

pytest test -vvv
ufmt format torchvision
flake8 torchvision

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise. Additionally, I'm adding a new function torchvision.io.encode_jpegs (plural) with uses a fused kernel and may be faster than successive calls to the singular version which incurs kernel launch overhead for each call. If it's alright, I'll be happy to refactor decode_jpeg to follow this convention in a follow up PR. Test Plan: 1. pytest test -vvv 2. ufmt format torchvision 3. flake8 torchvision Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-04-23T05:20:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8391

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit a799c53 with merge base 5181a85 ():

NEW FAILURES - The following jobs have failed:

Lint / bc (gh)
Process completed with exit code 1.
Lint / python-types / linux-job (gh)
RuntimeError: Command docker exec -t 06ea3095aac97b90662e6b944027fff6f6b2739006bf8d298206535bab35cb15 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

NicolasHug

Thanks a lot @deekay42 . I made another pass but this looks good!

test/test_image.py

torchvision/io/image.py

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

ahmadsharif1

Hi @deekay42,

I work on the video decoder in C++ so @NicolasHug thought that my comments may be useful for this PR.

I hope you find my comments useful, and feel free to push back.

I am also curious if you did any benchmarking to see how much speedup we get using hardware decoding or encoding?

ahmadsharif1 · 2024-05-01T15:34:14Z

torchvision/csrc/io/image/cuda/encode_decode_jpeg_cuda.h

+    ImageReadMode mode,
+    torch::Device device);
+
+C10_EXPORT std::vector<torch::Tensor> encode_jpeg_cuda(


Out of curiosity, why does this interface take in and return a std::vector instead of a single stacked Tensor? I am asking because in the TorchMedia decoder, our users wanted a single stacked tensor for all the video frames and doing the torch.stack() operation on a python list of tensors was time-consuming.

Just chiming in, these are good questions

We need to return a vector instead of a stacked tensor because the encoded jpeg bytes of each input image aren't necessarily of the same length. We wouldn't be able to fit those in a stacked tensor since by definition all the tensors in a stack must have the same shape.

As for why we're using a vector for the input: this is for similar reason, i.e. the input images may not all be of the same shape. It is possible that users want to pass a stacked tensors of images as input though, e.g. if this stack is the output of a (generative?) model. So eventually we should allow the Python API (image.io.decode_jpeg()) to support a stacked tensor of shape NCHW, and possibly convert that internally to a vec of tensors of shape CHW so it can be passed to this C++ function, but that can be left out for another PR and I didn't want to bother Dominik with that right now.

Nicolas beat me to it, but yes, the reason for the vector is due to heterogeneous input and output tensor shapes.

ahmadsharif1 · 2024-05-01T15:47:48Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+#include <c10/cuda/CUDAGuard.h>
+#include <nvjpeg.h>
+
+nvjpegHandle_t nvjpeg_handle = nullptr;


Nit: perhaps rename this to g_nvjpeg_handle so it is clear this is a global variable?

Same for nvjpeg_handle_creation_flag below.

ahmadsharif1 · 2024-05-01T15:48:54Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+        "The number of channels should be 3, got: ",
+        image.size(0));
+
+    // nvjpeg requires images to be contiguous


Nit: add a citation link if you can.

ahmadsharif1 · 2024-05-01T15:51:47Z

torchvision/csrc/io/image/cuda/encode_decode_jpeg_cuda.h

+    ImageReadMode mode,
+    torch::Device device);
+
+C10_EXPORT std::vector<torch::Tensor> encode_jpeg_cuda(


Nit: perhaps the name itself should indicate this is a plurality of images, like maybe encode_jpegs_cuda?

ahmadsharif1 · 2024-05-01T15:52:50Z

torchvision/csrc/io/image/cuda/encode_decode_jpeg_cuda.h

+
+C10_EXPORT std::vector<torch::Tensor> encode_jpeg_cuda(
+    const std::vector<torch::Tensor>& images,
+    const int64_t quality);


Nit: add a comment about quality. Is higher better or lower? What is the range/min/max here?

ahmadsharif1 · 2024-05-01T18:32:27Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+
+  for (int c = 0; c < channels; c++) {
+    target_image.channel[c] = src_image[c].data_ptr<uint8_t>();
+    // this is why we need contiguous tensors


Nit: maybe add a CHECK here to make sure the tensor is contiguous?

ahmadsharif1 · 2024-05-01T18:34:34Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+  }
+}
+
+torch::Tensor encode_single_jpeg(


Nit: put this in an anonymous namespace since this function is not public?

ahmadsharif1 · 2024-05-01T18:36:41Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+  }
+}
+
+torch::Tensor encode_single_jpeg(


Nit: this declaration can be omitted entirely if you move the implementation of this function above in an anonymous namespace, right?

ahmadsharif1 · 2024-05-01T18:44:07Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+      getStreamState);
+
+  // Synchronize the stream to ensure that the encoded image is ready
+  cudaError_t syncState = cudaStreamSynchronize(stream);


I don't know the answer to this question and I am curious if you know -- is there a way to just do a single streamSynchronize per batch instead of per image? That way we can pipeline some work for some extra speedup when handling a batch of images.

ahmadsharif1 · 2024-05-01T18:45:54Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+  size_t length;
+  nvjpegStatus_t getStreamState = nvjpegEncodeRetrieveBitstreamDevice(
+      nvjpeg_handle, nv_enc_state, NULL, &length, stream);
+  TORCH_CHECK(


Nit: maybe CHECK for the length > 0?

facebook-github-bot added the cla signed label Apr 23, 2024

deekay42 added 3 commits April 23, 2024 12:06

fix test cases

4cc30cb

fix lints

2db02f0

fix lints2

6acef83

NicolasHug reviewed Apr 29, 2024

View reviewed changes

deekay42 added 2 commits April 29, 2024 15:59

latest round of updates

ae0450d

fix lints

a799c53

NicolasHug requested a review from ahmadsharif1 May 1, 2024 15:13

ahmadsharif1 reviewed May 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding GPU acceleration to encode_jpeg #8391

Adding GPU acceleration to encode_jpeg #8391

deekay42 commented Apr 23, 2024

pytorch-bot bot commented Apr 23, 2024 •

edited

NicolasHug left a comment

ahmadsharif1 left a comment

ahmadsharif1 May 1, 2024

NicolasHug May 3, 2024 •

edited

deekay42 May 3, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

Adding GPU acceleration to encode_jpeg #8391

Are you sure you want to change the base?

Adding GPU acceleration to encode_jpeg #8391

Conversation

deekay42 commented Apr 23, 2024

pytorch-bot bot commented Apr 23, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8391

❌ 2 New Failures

NicolasHug left a comment

Choose a reason for hiding this comment

ahmadsharif1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug May 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch-bot bot commented Apr 23, 2024 •

edited

NicolasHug May 3, 2024 •

edited