Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization potential #138

Open
BenBE opened this issue Jan 24, 2022 · 12 comments
Open

Optimization potential #138

BenBE opened this issue Jan 24, 2022 · 12 comments
Labels
chore Things that need to happen but don't change the behaviour experimental Things we want to play with outside the stable build help wanted Extra attention is needed

Comments

@BenBE
Copy link
Collaborator

BenBE commented Jan 24, 2022

Doing some rudimentary profiling I noticed, we are wasting a huge amount of time of the main thread in essentially 3 functions:

  1. alpha_blend (~40%)
  2. convert_rgb_to_yuyv (~25%)
  3. cap.retrieve (~20%)

Values are relative time for main, which takes up roughly the same amount of time, we actually spend processing images (main ~32.75%, bs_maskgen_process ~33.95%, total runtime).

On the positive: bs_maskgen_process spends ~95% of the time waiting for processing by TFLite.

FWIW: Timings heavily affected by running under callgrind, but looking at the code of alpha_blend and convert_rgb_to_yuyv I'm not really surprised of these results. A rewrite of these functions using some vectoring should yield quite a bit of improvement.

@BenBE BenBE added help wanted Extra attention is needed experimental Things we want to play with outside the stable build chore Things that need to happen but don't change the behaviour labels Jan 24, 2022
@floe
Copy link
Owner

floe commented Jan 25, 2022

Ah, that does sound like SSE and friends could do a good job speeding things up. I guess the most portable way to do that is using gcc intrinsics?

@BenBE
Copy link
Collaborator Author

BenBE commented Jan 25, 2022

Either GCC intrinsics (as portable as they are) or we try to look into things and see if OpenCV/XNN already offers those two methods with an optimized version we could use.

@phlash
Copy link
Collaborator

phlash commented Jan 26, 2022

I'm aware that those were written because OpenCV doesn't do them.. but happy to be proven wrong if we up the minimum OCV version 😄

Found this Q&A: https://answers.opencv.org/question/211941/how-to-use-simd-feature-of-opencv-as-third-party-library/

I'm also aware that OCV will try to use OpenCL and GPU if they are available (it collided with TFLite when used in multiple threads!).. more research needed!

[edited to add] Found this which looks near for alpha_blend: https://stackoverflow.com/questions/13128696/which-is-the-most-efficient-way-to-do-alpha-mask-in-opencv

There is also libvolk for portable SIMD kernels (as used by Gnuradio and others), which is a standard package on many Linuxen.

@BenBE
Copy link
Collaborator Author

BenBE commented Jan 26, 2022

I don't have any objections on using a library for those particular vectorized kernels, as doing those optimizations ourselves would be more work than sensible.

Also check if other places in the current code base (e.g. the conv_bias stuff) could benefit from using the library.

@martok
Copy link

martok commented Jan 29, 2022

Just talked with @BenBE, time for my regularly scheduled one-off maybe-useful comment :)

alpha_blend (~40%)

OpenCV has a well-hidden function for that, blendLinear. Only difficulty here would be that it requires the mask and anti-mask, both as float.
So the call would look like cv::blendLinear(bg, raw, mask, Scalar::all(1.0)-mask, raw).

I don't actually know if this will be much faster, but they do have vectorised and OCL implementations.

convert_rgb_to_yuyv (~25%)

Really surprised that there is no convert_rgb_to_yuyv equivalent in cv, but it seems that unrolling the loop a few times in a way that explicitly looks like interleaved moves could be enough to get the compiler to generate the vectorised version without manual trickery.

@phlash
Copy link
Collaborator

phlash commented Jan 29, 2022

OK! Thanks for the pointer towards cv::blendLinear, I've just tried it and the performance gain seems marginal (~0.8 of previous run time). Digging in, I find that ~1/3 of the time is converting/inverting the mask, the rest blending, so it seems blendLinear is not implicitly using SIMD/OCL? Going back to local loop in a separate source file and playing with gcc intrinsics..

@BenBE
Copy link
Collaborator Author

BenBE commented Jan 29, 2022

If the initial conversion from uint8_t to float takes so much time, there is a good chance we can remove that conversion time when generating the mask in the first place, as the output from TFlite actually is float already, thus we already spend time to create the uint8_t version, but could just skip this part …

@phlash
Copy link
Collaborator

phlash commented Jan 30, 2022

Yep - when we get down to shaving a few % that's worth it - looking for 10x improvement through SIMD first 😄

@phlash
Copy link
Collaborator

phlash commented Jan 30, 2022

So: I tried MMX, loading a pixel at a time (3 channels) into an _m64 executing the multiplies, adds & a shift to divide down, then extracted the pixel back to memory - at least 5x slower than a loop 😞

I went back to the original loop code, separated into it's own source file and applied -march=native -O9, 3-4x faster 😄 This will do for me - it's easier than SIMD, portable and guaranteed not to be worse than we have?

@BenBE
Copy link
Collaborator Author

BenBE commented Jan 30, 2022

Doing some proper calculations, it's quite obvious from hardware architecture alone that loading 3 bytes at a time into one MMX register for processing is not gonna cut it: You'll have at least tons of unaligned memory accesses and half the pipeline usually unoccupied. Given that we know based on the size restrictions introduced by MJPG from the capture device, we know we'll always have a multiple of 16 pixels*, thus could unroll the per-pixel operation 16x and process 48 bytes at a time.

If we then use some SSE magic, we get an inner kernel that could perform this blending operation with the constraints of just 7 registers** needed.

// Setup:
uint8_t* a,b; // Source image
uint8_t* c; // Mask image
uint8_t* d; // Destination image

// Kernel (advance by a+=48; b+=48; c+=16; d+=48; per iteration; prefetch linerar)
xmm0 = *(a + 0);
xmm1 = *(a + 16);
xmm2 = *(a + 32);
separateRGB(xmm0, xmm1, xmm2 --(xmm6)--> xmm0, xmm1, xmm2);

xmm3 = *(b + 0);
xmm4 = *(b + 16);
xmm5 = *(b + 32);
separateRGB(xmm3, xmm4, xmm5 --(xmm6)--> xmm3, xmm4, xmm5);

xmm6 = *(c + 0); // ***

xmm0 = xmm0 * xmm6 >> 8;
xmm1 = xmm1 * xmm6 >> 8;
xmm2 = xmm2 * xmm6 >> 8;

xmm6 = ~xmm6;

xmm3 = xmm3 * xmm6 >> 8;
xmm4 = xmm4 * xmm6 >> 8;
xmm5 = xmm5 * xmm6 >> 8;

xmm0 += xmm3;
xmm1 += xmm4;
xmm2 += xmm5;

mergeRGB(xmm0, xmm1, xmm2 --(xmm6)--> xmm0, xmm1, xmm2);
*(d + 0) = xmm0;
*(d + 16) = xmm1;
*(d + 32) = xmm2;

Proper pipelining of the instructions inside the kernel loop should allow for the actual memory fetches and register assignments to run with minimal congestion.

  • Actually we get multiples of 64 pixels, as width and height of the image are both restricted to multiples of 8. We currently don't enforce this, but the backing camera device will given we mostly use MJPG or related codecs for video input.

** Technically we have 8 available for SSE, but with clever ordering of things you might be able to get down to only requiring 7 to hold all (intermediate) data.

*** If you happen to get this move scheduled before loading the second image you could start linear-blending the source image while still loading the data for the target image.

@phlash
Copy link
Collaborator

phlash commented Feb 1, 2022

Sounds like you know waay more than I ever will @BenBE 😄 I've never looked into SIMD stuff before now, so my feeble attempt was through cut/paste/fixup from stack overflow of an example MMX alpha blend function.

My current pr in #139 reduces the blend time to just under 1ms for the majority of executions (I'm simply using the execution timing built into the app), I would be delighted to see SIMD code based on the above analysis, then we can decide if the additional complexity is worth the time saving?

@BenBE
Copy link
Collaborator Author

BenBE commented Feb 1, 2022

Just noticed you could even skip the channel reordering, because the pixel scaling is uniform across all channels, which makes the kernel even simpler …

Will have to look into the actual GCC intrinsic stuff though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Things that need to happen but don't change the behaviour experimental Things we want to play with outside the stable build help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants