-
-
Notifications
You must be signed in to change notification settings - Fork 56.3k
imgproc: add optimized warpAffine kernels for 8U/16U/32F + C1/C3/C4 inputs #25984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a3f2942 to
bb00632
Compare
4a641e1 to
3fb43dd
Compare
This comment was marked as resolved.
This comment was marked as resolved.
0af05d4 to
7abaafe
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
Decided to drop |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
Not sure why OCL tests failed since this PR should touch nothing about OCL. Example: https://github.com/opencv/opencv/actions/runs/10557877242/job/29246289055?pr=25984 |
This comment was marked as resolved.
This comment was marked as resolved.
| #if CV_NEON_AARCH64 | ||
| return v_int32x4(vcvtmq_s32_f32(a.val)); | ||
| #else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instruction is available on aarch32 too https://developer.arm.com/architectures/instruction-sets/intrinsics/vcvtmq_s32_f32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was struggled at armv7 support. There is no existing macros indicating if target supports only armv7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is manual "CAROTENE_NEON_ARCH" option. Looks like we need build check or handle march option correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need build check or handle march option correctly.
That sounds like a dedicated PR. How about guard it (as well as the above one) with CV_NEON_AARCH64 for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use __ARM_ARCH > 7 check. I have armv7 board and test it regularly.
|
I propose to backport TS related changes and rounding/floor optimization to 4.x to reduce merge conflicts and bring the global optimization too. |
|
@fengyuentau I benchmarked your code (couple of commits ago). See details in archive. There are single thread benchmarks also. The patch looks good for ARM, but there are some stable regressions on x86. Please take a look. |
2e30217 to
abdaeab
Compare
@asmorkalov Performance degradation happens on |
|
@vpisarev @asmorkalov I updated code with the use of algo hint. |
|
@asmorkalov @vpisarev Could you review this PR? I already proceeded with this branch and implemented the new warpPerspective kernels. Want to create a new PR for that. |
| const Size srcSize = get<0>(params); | ||
| const int type = get<1>(params), interpolation = get<2>(params); | ||
| const double eps = CV_MAT_DEPTH(type) <= CV_32S ? 1 : interpolation == INTER_CUBIC ? 2e-3 : 1e-4; | ||
| const double eps = CV_MAT_DEPTH(type) <= CV_32S ? 2 : interpolation == INTER_CUBIC ? 2e-3 : 3e-2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the change does not make sense as soon as you introduced AlgorithmHint. I reverted the change and no not see regressions with Intel OpenCL (iGPU) and NVIDIA GF 1080.
| { | ||
| for (int j = 0; j < test_loop_times; j++) | ||
| { | ||
| double eps = depth < CV_32F ? 0.04 : 0.06; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to convert it to conditions with explicit types. We have fp16, int64, bool that comes after fp64. Also it's hard to understand the condition.
| { | ||
| for (int j = 0; j < test_loop_times; j++) | ||
| { | ||
| double eps = depth < CV_32F ? 0.04 : 0.06; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same proposal here.
slightly alter threshold for warpAffine optimization #3787 Merge with opencv/opencv#25984 New`onfusionMatrixes[1]` is ``` [[45 0 0 0 0 0 0 0 0 0] [ 0 57 0 0 0 0 0 0 0 0] [ 0 0 58 2 0 0 0 0 1 0] [ 0 0 0 43 0 0 0 1 0 0] [ 0 0 0 0 39 0 0 0 0 1] [ 0 0 0 1 0 49 0 0 1 0] [ 0 0 0 0 0 0 52 0 0 0] [ 0 0 1 0 0 0 0 54 0 0] [ 0 0 0 0 0 0 0 0 47 0] [ 0 1 0 1 0 0 0 0 2 44]] ``` which is about of pixel value 1 shift in each 4x or 5x pixel value. ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
Merge wtih opencv/opencv_extra#1198.
Merge with opencv/opencv_contrib#3787.
Perf:
M2 (16GB ram, with fp16 vector intrinsics support)
Intel i7-12700K
Khadas VIM3 (A311D, 4xA73+2xA53, no fp16 vector intrinsics support)
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.