New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster implementation of torchvision.ops.boxes::masks_to_boxes #8194
base: main
Are you sure you want to change the base?
Conversation
馃敆 Helpful Links馃И See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8194
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @atharvas! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Thanks for the PR @atharvas The test failure in https://github.com/pytorch/vision/actions/runs/7394640746/job/20496609120?pr=8194 seems relevant, as But beyond the correctness issue, I am wondering whether such a change would bring a critical improvement. Saving 1MB isn't really a problem, and the time performance benefits are unclear when batch size is reasonable (i.e. < 1024?). Considering the proposed code is a lot more complex than the previous one, we should only be comfortable merging this PR if it could remove a known bottleneck. Did you find that |
The X and Y dimensions were flipped.
Hi! Thanks for the update. Just fixed the error after recreating it on a linux box. Sorry about that!
About whether this is a critical improvement For my use case yes, however, I'm not sure for the community at large. My codebase is private right now, but here's an example of this bottleneck "in the wild:" This function processes a batch of Inside this function, However, because the apparent batch size is actually |
Thank you for the details @atharvas . That SGTM. Before merging, do you mind running the benchmark on CUDA as well to make sure there's no slow-down for GPUs? If that were to be the case, we could just have 2 paths (one for CPU, one for GPU) |
Hi! The second graph is interesting! It looks like the new implementation uses up more VRAM (in line with @pmeier's commend in #8184 ). I did an analysis of the GPU allocations using @NicolasHug Do you think this warrants a separate pathway for CPU and GPU? |
Quoting #8184
馃殌 The feature
torchvision.ops.boxes::masks_to_boxes is used to convert a batch of binary 2D image masks to a set of bounding boxes. Essentially, for a mask of shape$(B, 64, 64)$ , $(B, 4)$ and calculates the bounding box for each element of the batch sequentially. This is $O(B)$ storage and $O(B)$ runtime and the simplest implementation possible.
masks_to_boxes
allocates a tensor of shapeThis proposal pertains to creating a faster and more general version of this function.
Profiling
Some primitive performance benchmarking validates this hypothesis. Profiling code is here. Profiling was done on an
Apple M2 Pro
.Memory Profiling
Speed Profiling
Correctness
The behavior of the function is unchanged. There is a single test case for testing
masks_to_boxes
(test.test_ops:test_masks_box
). The new implementation passes this test case.