Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running final binary as sudo #41

Open
alexzk1 opened this issue Jul 5, 2018 · 82 comments
Open

Running final binary as sudo #41

alexzk1 opened this issue Jul 5, 2018 · 82 comments
Assignees
Labels

Comments

@alexzk1
Copy link

alexzk1 commented Jul 5, 2018

...and getting

INFO:0] Initialize OpenCL runtime...
[ INFO:0] Successfully initialized OpenCL cache directory: /root/.cache/opencv/3.4.1/opencl_cache/
[ INFO:0] Preparing OpenCL cache configuration for context: 32-bit--Broadcom--VideoCore_IV_GPU--0_4
OpenCV(3.4.1) Error: Unknown error code -220 (OpenCL error CL_INVALID_VALUE (-30) during call: clGetProgramInfo(handle, CL_PROGRAM_BINARIES, sizeof(ptr), &ptr, NULL)) in getProgramBinary, file /build/opencv/src/opencv-3.4.1/modules/core/src/ocl.cpp, line 3752
[ WARN:0] Can't save OpenCL binary into cache: /root/.cache/opencv/3.4.1/opencl_cache/32-bit--Broadcom--VideoCore_IV_GPU--0_4/imgproc--filterSepRow_e99b92fca8604fe253f3c641802ce117.bin
OpenCV(3.4.1) /build/opencv/src/opencv-3.4.1/modules/core/src/ocl.cpp:3752: error: (-220) OpenCL error CL_INVALID_VALUE (-30) during call: clGetProgramInfo(handle, CL_PROGRAM_BINARIES, sizeof(ptr), &ptr, NULL) in function getProgramBinary

OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('row_filter_C1_D0', dims=2, globalsize=240x544x1, localsize=16x16x1) sync=false
OpenCV(3.4.1) Error: Unknown error code -220 (OpenCL error CL_INVALID_VALUE (-30) during call: clGetProgramInfo(handle, CL_PROGRAM_BINARIES, sizeof(ptr), &ptr, NULL)) in getProgramBinary, file /build/opencv/src/opencv-3.4.1/modules/core/src/ocl.cpp, line 3752
[ WARN:0] Can't save OpenCL binary into cache: /root/.cache/opencv/3.4.1/opencl_cache/32-bit--Broadcom--VideoCore_IV_GPU--0_4/imgproc--filterSepRow_e99b92fca8604fe253f3c641802ce117.bin
OpenCV(3.4.1) /build/opencv/src/opencv-3.4.1/modules/core/src/ocl.cpp:3752: error: (-220) OpenCL error CL_INVALID_VALUE (-30) during call: clGetProgramInfo(handle, CL_PROGRAM_BINARIES, sizeof(ptr), &ptr, NULL) in function getProgramBinary

OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('row_filter_C1_D0', dims=2, globalsize=240x544x1, localsize=16x16x1) sync=false
FPS 0.0205593, Objects: 0
OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('row_filter_C1_D0', dims=2, globalsize=240x544x1, localsize=16x16x1) sync=false
OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54) during call: clEnqueueNDRangeKernel('row_filter_C1_D0', dims=2, globalsize=240x544x1, localsize=16x16x1) sync=false

Is it problem of what ? >: Not supported features, rights, opencl?
Made just in case

sudo chmod 777 /root/.cache/opencv/3.4.1/opencl_cache/32-bit--Broadcom--VideoCore_IV_GPU--0_4

and didnt work

@doe300
Copy link
Owner

doe300 commented Jul 5, 2018

You get two different error codes:

OpenCL error CL_INVALID_WORK_GROUP_SIZE (-54)

OpenCV seems to not be able to handle small work-group sizes, or at least has a lower bound larger than the supported work-group size of VC4CL.

Error: Unknown error code -220

This seem to be that the call to clGetProgramInfo failed with somehow both the error code CL_INVALID_VALUE (-30) and an unknown error code -220.

@alexzk1
Copy link
Author

alexzk1 commented Jul 7, 2018

Fails here:Mailbox::executeQPU on sending ioctl, everything hangs
as

[VC4CL] Running work-group 0, 0, 0
[VC4CL] Mailbox buffer before:
[VC4CL] 0000: 0x00000028
[VC4CL] 0004: 0x00000000
[VC4CL] 0008: 0x00030011
[VC4CL] 000c: 0x00000010
[VC4CL] 0010: 0x00000010
[VC4CL] 0014: 0x0000000c
[VC4CL] 0018: 0x5e10df58
[VC4CL] 001c: 0x00000000
[VC4CL] 0020: 0x00007530
[VC4CL] 0024: 0x751687cc

> Code is:
>  R"CLC(
>         __kernel void TestMagic(const int total, const int is_deeper_magic, const float alpha_s, const float fore_th, __global const float* gradx, __global const float* grady,
>                                                  //in/out
>                                                  __global float* BSx,  __global float* BSy, __global int* mapRes)
>         {
>             private const size_t i        = get_global_id(0);
>             private const size_t gpu_used = get_global_size(0);
> 
>             private const size_t elements_count = total / (gpu_used * 16);
>             private const size_t offset = i * total / gpu_used;
> 
>             for (size_t k = 0; k < elements_count; ++k)
>             {
>                int16 mr           = vload16( k , mapRes + offset);
>                const int16 twos   = 2;
>                mr += twos;
>                vstore16(mr, k, mapRes + offset);
>             }
>         }
>         )CLC",
> > 

During compilation:

[VC4CL] Precompiling source with:
[VC4CL] Precompilation complete with status: 0
[VC4CL] Compilation log: [W] Sat Jul 7 03:04:50 2018: Warnings in precompilation:
[W] Sat Jul 7 03:04:50 2018: :90:9: warning: null character ignored
<U+0000>
^
1 warning generated.

Any ideas? >: QPULib examples work...(https://github.com/mn416/QPULib)

@alexzk1
Copy link
Author

alexzk1 commented Jul 7, 2018

img_2906

@doe300
Copy link
Owner

doe300 commented Jul 7, 2018

Sometimes, if the code generated for a kernel does something completely wrong, then the QPUs get into a hanged state. Looks like this is happening here

@alexzk1
Copy link
Author

alexzk1 commented Jul 7, 2018

and how to deal with? is it compiler problem or my code?

@doe300
Copy link
Owner

doe300 commented Jul 7, 2018

Most likely a compiler problem. I found an issue which also occurs in the kernel given, and will try to fix it.

@doe300 doe300 self-assigned this Jul 7, 2018
@doe300 doe300 added the bug label Jul 7, 2018
@alexzk1
Copy link
Author

alexzk1 commented Jul 7, 2018

what exactly ? I could change my kernel maybe to skip this problem.

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

This commit does not fix issue, still hangs
doe300/VC4C@f13120d

But confirmed - it is wrong division, modified kernel works like a charm:

__kernel void TestMagic(const int total, const int is_deeper_magic, const float alpha_s, const float fore_th, __global const float* gradx, __global const float* grady,
                                                 //in/out
                                                 __global float* BSx,  __global float* BSy, __global int* mapRes)
        {
            private const size_t i        = get_global_id(0);
            private const size_t gpu_used = get_global_size(0);

            private const size_t elements_count = 1;//total / (gpu_used * 16);
            private const size_t offset = i * total ;/// gpu_used;

            for (size_t k = 0; k < elements_count; ++k)
            {
               int16 mr           = vload16( k , mapRes + offset);
               const int16 twos   = 2;
               mr += twos;
               vstore16(mr, k, mapRes + offset);
            }
        }

@doe300
Copy link
Owner

doe300 commented Jul 9, 2018

Yeah, the integer division (and modulo) is still wrong. I could not figure out the reason for this yesterday.

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

does it mean float div will work? I need that for atan2, which contains same integer div (for offset) and inside atan2 y/x ... btw in you souce it is some comment there, maybe you do x/y ?

@doe300
Copy link
Owner

doe300 commented Jul 9, 2018

Floating point division works, at least for the tests I ran, with worse-than-allowed accuracy in some cases. I don't know if the atan2 function works..

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

Yes, thats weird. Opencl on nvidia give a bit different results, then opencv-default(cpu?) does:
1st is opencv-default, 2nd is opencl-nvidia kernel:

DUMP:
x = 0; y = 2; a = 1.5708
x = -4; y = 4; a = 2.35636
x = 6; y = 8; a = 0.927403
x = 21; y = 13; a = 0.554224
x = 18; y = 16; a = 0.726774
x = 12; y = 6; a = 0.463683
x = 15; y = -13; a = 5.569
x = 8; y = -22; a = 5.06123
x = -9; y = -11; a = 4.02667
x = -11; y = 11; a = 2.35636
x = -2; y = 18; a = 1.68144
x = 4; y = 4; a = 0.785232
x = 4; y = -12; a = 5.0342
x = 7; y = -13; a = 5.20633
x = 8; y = 0; a = 0
x = 0; y = 10; a = 1.5708
DUMP:
x = 0; y = 2; a = 1.5708
x = -4; y = 4; a = 2.35619
x = 6; y = 8; a = 0.927295
x = 21; y = 13; a = 0.554308
x = 18; y = 16; a = 0.726642
x = 12; y = 6; a = 0.463648
x = 15; y = -13; a = 5.56909
x = 8; y = -22; a = 5.06116
x = -9; y = -11; a = 4.02666
x = -11; y = 11; a = 2.35619
x = -2; y = 18; a = 1.68145
x = 4; y = 4; a = 0.785398
x = 4; y = -12; a = 5.03414
x = 7; y = -13; a = 5.20633
x = 8; y = 0; a = 0
x = 0; y = 10; a = 1.5708

Kernel is

private float16 x = vload16( k , gradx + offset);
               private float16 y = vload16( k , grady + offset);
               float16 a = atan2(y, x);
               a = select(a, a + 2 * (float)M_PI, a < 0);
               vstore16(a, k, radians + offset);

On RPI something is broken completely:

DUMP:
x = 0; y = 2; a = 1.5708
x = -4; y = 4; a = 2.35636
x = 6; y = 8; a = 0.927403
x = 21; y = 13; a = 0.554224
x = 18; y = 16; a = 0.726774
x = 12; y = 6; a = 0.463683
x = 15; y = -13; a = 5.569
x = 8; y = -22; a = 5.06123
x = -9; y = -11; a = 4.02667
x = -11; y = 11; a = 2.35636
x = -2; y = 18; a = 1.68144
x = 4; y = 4; a = 0.785232
x = 4; y = -12; a = 5.0342
x = 7; y = -13; a = 5.20633
x = 8; y = 0; a = 0
x = 0; y = 10; a = 1.5708
DUMP:
x = 0; y = 2; a = 0
x = -4; y = 4; a = -0.785399
x = 6; y = 8; a = 0.927296
x = 21; y = 13; a = 0.554308
x = 18; y = 16; a = 0.726643
x = 12; y = 6; a = 0.463648
x = 15; y = -13; a = -0.714091
x = 8; y = -22; a = -1.22203
x = -9; y = -11; a = 0.885067
x = -11; y = 11; a = -0.785399
x = -2; y = 18; a = -1.46014
x = 4; y = 4; a = 0.785399
x = 4; y = -12; a = -1.24905
x = 7; y = -13; a = -1.07686
x = 8; y = 0; a = 0
x = 0; y = 10; a = 0

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

Used nvidia reference implementation (http://developer.download.nvidia.com/cg/atan2.html),on desktop it's ok.

float16 myatan2(float16 y, float16 x)
        {
          float16 t0, t1, t2, t3, t4;

          t3 = fabs(x);
          t1 = fabs(y);
          t0 = max(t3, t1);
          t1 = min(t3, t1);
          t3 = 1.f / t0;
          t3 = t1 * t3;

          t4 = t3 * t3;
          t0 =         - 0.013480470f;
          t0 = t0 * t4 + 0.057477314f;
          t0 = t0 * t4 - 0.121239071f;
          t0 = t0 * t4 + 0.195635925f;
          t0 = t0 * t4 - 0.332994597f;
          t0 = t0 * t4 + 0.999995630f;
          t3 = t0 * t3;

          t3 = (fabs(y) > fabs(x)) ? 1.570796327f - t3 : t3;
          t3 = (x < 0) ?  3.141592654f - t3 : t3;
          t3 = (y < 0) ? -t3 : t3;

          return t3;
        }

On RPI it works as well:

DUMP:
x = 0; y = 2; a = 1.5708
x = -4; y = 4; a = 2.35636
x = 6; y = 8; a = 0.927403
x = 21; y = 13; a = 0.554224
x = 18; y = 16; a = 0.726774
x = 12; y = 6; a = 0.463683
x = 15; y = -13; a = 5.569
x = 8; y = -22; a = 5.06123
x = -9; y = -11; a = 4.02667
x = -11; y = 11; a = 2.35636
x = -2; y = 18; a = 1.68144
x = 4; y = 4; a = 0.785232
x = 4; y = -12; a = 5.0342
x = 7; y = -13; a = 5.20633
x = 8; y = 0; a = 0
x = 0; y = 10; a = 1.5708
DUMP:
x = 0; y = 2; a = 1.5708
x = -4; y = 4; a = 2.3562
x = 6; y = 8; a = 0.927294
x = 21; y = 13; a = 0.554308
x = 18; y = 16; a = 0.72664
x = 12; y = 6; a = 0.463646
x = 15; y = -13; a = -0.714088
x = 8; y = -22; a = -1.22203
x = -9; y = -11; a = -2.25652
x = -11; y = 11; a = 2.3562
x = -2; y = 18; a = 1.68145
x = 4; y = 4; a = 0.785395
x = 4; y = -12; a = -1.24905
x = 7; y = -13; a = -1.07686
x = 8; y = 0; a = 0
x = 0; y = 10; a = 1.5708

Almost...negatives may mean it does not do "select" properly tooin this line:
a = select(a, a + pi2, a < 0);

@doe300
Copy link
Owner

doe300 commented Jul 9, 2018

Opencl on nvidia give a bit different results, then opencv-default

Depending on the compilation flags you specified, the NVIDIA code may use faster but inaccurate operations (e.g. due to -cl-fast-relaxed-math or -cl-mad-enable).

On RPI something is broken completely:

This looks like the atan2 function yielding wrong results. Some/a lot of the math functions in VC4CL are not correct or properly tested.

Used nvidia reference implementation [...]

Thanks for the link, this might come in very handy, if I can figure out its license...

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

Still can't make "select" work >: is it broken too ? Bcs I was dependant on in another kernel as well (that initial I simplified for test).
It seems it just do nothing.

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

Depending on the compilation flags you specified, the NVIDIA code may use faster but inaccurate operations (e.g. due to -cl-fast-relaxed-math or -cl-mad-enable).

I was using progs.build("-cl-opt-disable");

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

Okey, replaced "select" by a = fmod(a + pi2, pi2);
But that is "half-solution" actually, because how to do logic yet? >:

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

Ok made custom select for float16, for integers must be even simplier:

 float16 myselect(float16 afalse, float16 atrue, int16 condition)
        {
            //we have -1 = true in condition ...it should be so
            float16 cond = convert_float16(condition) * -1.f;
            float16 not_cond = 1.f - cond;
            return atrue * cond + afalse * not_cond;
        }
int16 myselecti16(int16 afalse, int16 atrue, int16 condition)
        {
            //we have -1 = true in condition ...it should be so
            int16 cond     = -1 * condition;
            int16 not_cond = 1 - cond;
            return atrue * cond + afalse * not_cond;
        }

btw this works 10 times faster, in original select i had to do convert_int16 because result was float16 on rpi, and that was 200ms instead 17ms now for kernel on nvidia. On RPI though no difference on speed (still not sure if original select was working at all, most likely not)

doe300 added a commit to doe300/VC4CLStdLib that referenced this issue Jul 9, 2018
@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

Do I need to rebuild compiler and vc4cl if stdlib updated? or just reinstall stdlib (including pch) ?

@doe300
Copy link
Owner

doe300 commented Jul 9, 2018

You need to re-trigger building of the PCH and BC for vc4cl-stdlib, which is done by the script located in ${CMAKE_BINARY_DIR}/build/postinst from the VC4C project for installed stdlib and make vc4cl-stdlib for stdlib sources (in which case you have to delete <VC4CLStdLib-root>/include/VC4CLStdLib.h.pch and <VC4CLStdLib-root>/include/VC4CLStdLib.bc).

@alexzk1
Copy link
Author

alexzk1 commented Jul 9, 2018

Ok..so no compiler rebuild if pch is built as separated package like here https://github.com/alexzk1/vc4_stdlib_arch/blob/master/PKGBUILD ?

Btw, building VC4C gives many such warnings, is it ok ? (gcc 8)
usr/include/c++/8.1.0/bits/stl_vector.h: In member function 'std::vector Parser::parseArgumentList(const string&, size_t)':
/usr/include/c++/8.1.0/bits/stl_vector.h:1085:4: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator<exprValue*, std::vector >' changed in GCC 7.1
In file included from /usr/include/c++/8.1.0/vector:69,

Also:

: warning: "_GNU_SOURCE" redefined
: note: this is the location of the previous definition

I think that u kinda missing inline or so ... or maybe #ifdef or #undef

@doe300
Copy link
Owner

doe300 commented Jul 10, 2018

Well, the warnings are both libstdc++ internal warnings. I can try to disable the warning, but I don't think I can do anything about fixing them.

doe300 added a commit to doe300/VC4C that referenced this issue Jul 11, 2018
doe300 added a commit to doe300/VC4C that referenced this issue Jul 11, 2018
@alexzk1
Copy link
Author

alexzk1 commented Jul 13, 2018

fmod(float16, float16) is still broken - pi hangs.
And maybe something else, replaced fmod by equivalent, but...on next run it was ok, and next-next run hanged again...

Got some error more (works on desktop):

Failed to compiler kernels:
[E] Fri Jul 13 19:39:36 2018: Error assigning local to register: %vecinit33
[E] Fri Jul 13 19:39:36 2018: Error assigning local to register: %vecinit52
[E] Fri Jul 13 19:39:37 2018: Background worker threw error: Label/Register Mapping: There are erroneous register-associations!
[E] Fri Jul 13 19:39:37 2018: While running worker task: CodeGenerator
[E] Fri Jul 13 19:39:40 2018: Compiler threw exception: Label/Register Mapping: There are erroneous register-associations!

@alexzk1
Copy link
Author

alexzk1 commented Jul 13, 2018

img_2908
No kernel panic, but that red dots around... maybe because connected monitor? And hang.
I'm trying to do chained kernels calls (from C++) with preserving __global buffer between calls - out from 1st comes as input to 2nd.

Ok, it seems broken islessequal(float16, float16) and isgreaterequal(float16, float16) - replaced by isless(float16, float16) and code works a bit longer...>: and still hangs. Not sure.
Okey, seems I narrowed down the problem:

This code fails:

                atest =  isless(fabs(angle - pi2), pi8); //90 not sure why, but this works better 90 = up/left
                 p1 = myselectf16(p1, Z4, atest);
                 p2 = myselectf16(p2, Z6, atest);

This works:

atest =  isless(angle, pi8) ; 
                 p1 = myselectf16(p1, Z2, atest);
                 p2 = myselectf16(p2, Z8, atest);

Zs are:

#define Z1 ((float16)(a, b.s0123, b.s456789ab, b.scde))
        #define Z2 (b)
        #define Z3 (float16) (b.s123, b.s4567, b.s89abcdef, c)
        #define Z4 ((float16)(d, e.s0123, e.s456789ab, e.scde))
        #define Z5 (e)
        #define Z6 (float16)(e.s123, e.s4567, e.s89abcdef, f)
        #define Z7 (float16)(g, h.s0123, h.s456789ab, h.scde)
        #define Z8 (h)
        #define Z9 (float16)(h.s123, h.s4567, h.s89abcdef, i)

doe300 added a commit to doe300/VC4C that referenced this issue Jul 23, 2018
@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

To write to any memory? Or kinda "system memory" ? Because opencl assumes, such written buffer must stay in GPU mem until explicity copied to system in C++ code (which is slow).
I know PI uses same memory, however, why many vload's are much faster then 1 vstore? It seems like you do there implicit copying to sys-memory on store.

@doe300
Copy link
Owner

doe300 commented Jul 23, 2018

There is just one memory. The GPU directly accesses the CPU memory, it does not have any of its own.

For __private and __local memory, VC4C tries to optimize the memory access by either keeping the values in the registers or using the VPM (a small cache also used to write to RAM). For __constant memory, there is also an optimization trying to load the contents directly into registers.

__global memory is always located in the RAM. Also, if a memory location (the same parameter) is not written to, the loading of the data can also be sped up a bit.

On the host-side, since the memory is really shared, accessing a mapped buffer (e.g. via clEnqueueMapBuffer) accesses directly the physical addresses also accessed by the GPU resulting in no extra copying required.

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

So...there is no way to beat that ? >: Usual technique was split task into couple kernels, and keep passing buffers between them. And only final one is copied to sys memory. So, I run 3 kernels. As you see on 2nd kernel example - load all data from 1st + calculations = 8ms. Storing data to pass to 3rd = 90ms extra. I don't get why loads are much faster then storing, if it uses same unit linearly.

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

So what If you will check how that buffer was created, and if not accessed from host side (kinda gpu-only buffer) then you skip locks ? It is kernel's developer problem to keep all proper.
Then, when cl::copy called on such buffer, you will have to copy/lock, but, that will be done once at the end of the chain of kernels instead on each step.

@doe300
Copy link
Owner

doe300 commented Jul 23, 2018

So...there is no way to beat that ?

No, that is a hardware limitation. Unless of course there is some other hardware component I don't know about, or there is some trick to be done with the known components...

I don't get why loads are much faster then storing, if it uses same unit linearly.

As I wrote in the post above, loads (if the same parameter is not written into, e.g. by const pointer) are optimized. They use a different hardware component, which is mainly for reading colors for texture processing, but can also read arbitrary data (and most importantly: in parallel!)

So what If you will check how that buffer was created, [...]

How would you do that? At compilation time I don't know which memory is really accessed host-side (at least not for __global memory). Also, using the VPM as cache requires locking too, but for a much shorter time-span, since the data does not need to be copied into RAM under lock.

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

Do you take in account "restrict" keyword too ? As I understood, it tells compiler no overlaps are expected on this pointer.

..ok well, is any way to take lock once for couple vstore? For example, store 16 of float16 at once in 1 lock ?

@doe300
Copy link
Owner

doe300 commented Jul 23, 2018

Do you take in account "restrict" keyword too ?

Currently no, I more or less assume all parameters to be restricted, which might not be right. But treating the correctly would decrease performance...

ok well, is any way to take lock once for couple vstore?

It is done. vstore16 writes (guarded by the hardware mutex) 16 elements first into VPM cache and then into RAM.

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

It is done. vstore16 writes (guarded by the hardware mutex) 16 elements first into VPM cache and then into RAM.

Well, I have loop there, which does vstore16 for float16 and loop itself is for 16 items. I was thinking to cache it, and store 16 of float16 (256 numbers) in 1 lock.

@doe300
Copy link
Owner

doe300 commented Jul 23, 2018

Ah okay.
There is an optimization for grouping stores into contiguous memory, but I don't think it currently works with loops, unless they are unrolled.

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

Okey..tried this one (I will join in 1 piece code to get easier copy/paste):

#define BACK_STORE vstore16(cache[k--], 0,  ( __global float*)(N + dstPaddedIndex)); dstPaddedIndex-=srcXStride

//--------------------------------

int k = 15;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;
              BACK_STORE;

It is slower then in a loop.
Tried this
#define BACK_STORE vstore16(cache[k++], 0, ( __global float*)(N + dstPaddedIndex2)); dstPaddedIndex2 += srcXStride

Same - a bit slower then a loop >:

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

How big is VPM cache? What if you will do copy to ram on explicit call of cl::copy (or cache overflow)? Otherwise, if kernels are chained, you will have same address out/in and can keep using VPM then.

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

If I will lower amount of work items (i.e. will do deep nested loops in kernel), will it speed up vstore's somehow?

@doe300
Copy link
Owner

doe300 commented Jul 23, 2018

How big is VPM cache?

64 rows (actually more, but only 64 can be addressed) with 16 words (32-bit) per row, but addressing works per row.

What if you will do copy to ram on explicit call of cl::copy (or cache overflow)?

Using the VPM as cache and only writing back when the cache is full would be a useful optimization, but also very hard to realize (esp. dynamically, e.g. with loops). The cache space needs to be split into blocks per QPU (core) and for each core the number of remaining free cache lines need to be tracked...

Otherwise, if kernels are chained, you will have same address out/in and can keep using VPM then.

This would be very useful, but the compiler cannot know if the input comes from another kernel or the host.

If I will lower amount of work items (i.e. will do deep nested loops in kernel), will it speed up vstore's somehow?

I doubt it, it will probably stay the same.

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

Damn, then it kills all usage of GPU >: 1:10 ratio of calculation/storing result is crazy.

@alexzk1
Copy link
Author

alexzk1 commented Jul 23, 2018

Look here, this guy describes he gets bigger speeds:
https://github.com/mn416/QPULib

@doe300
Copy link
Owner

doe300 commented Jul 24, 2018

He introduces special functions to do asynchronous load and store.
OpenCL only provides the async_work_group_copy and prefetch functions, which both are not really what you need and also both cannot be really optimized for the VC4.

One thing though I saw in his code, or more accurately didn't see in his code, was mutex locking...

@alexzk1
Copy link
Author

alexzk1 commented Jul 24, 2018

Do you use TMU too ? Is it for load only?
https://petewarden.com/2014/08/07/how-to-optimize-raspberry-pi-code-using-its-gpu/

@doe300
Copy link
Owner

doe300 commented Jul 24, 2018

Yeah, for parameters which are only read and not written to, the TMU is used.

That is the component I mentioned, which is mainly for reading colors for textures and can read in parallel

@alexzk1
Copy link
Author

alexzk1 commented Jul 25, 2018

Can you add non-opencl function, which will accept ARRAY of values or pointer (and it's size) and will do vstore in single lock? So usage can be like

#ifdef V4C
float16 tmp[16];
v4c_vstore(tmp, 16, offset, ram_pointer)
#endif

Will it speedup stores?

@doe300
Copy link
Owner

doe300 commented Jul 25, 2018

What exactly do you mean?
What would the function definition look like? What should be the behavior for the vc4c_vstore?

With the current code you may be able to do the following:

 vc4cl_mutex_lock();
vc4cl_dma_write(ptr, data[0]);
vc4cl_dma_write(ptr + offset, data[1]);
vc4cl_dma_write(ptr + offset * 2, data[2]);
vc4cl_dma_write(ptr + offset * 3, data[3]);
// ...
vc4cl_mutex_unlock();

This uses internal VC4CLStdLib functions, but they might change at any time without warnings!

@alexzk1
Copy link
Author

alexzk1 commented Jul 25, 2018

Yes, I was thinking if using single mutex to store many values instead 1 will speed up process. vstore takes 1 value (variable), I was thinking to use array of variables, i.e. in my example it stores 256 floats at once.

Also compiler should do some #define so can be recognized at compile time.

What should be the behavior for the vc4c_vstore

It should accept array of values (float, float8, float16 - any kind), size of this array, output start (as vstore do) and some "stride", so each next element of array is out to stride * index + pointer_ram (if it is vector, all elements are going out sequentially starting from calculated adress)
And then it does same as vstore do, - looping array and output, but inside single lock,

@alexzk1
Copy link
Author

alexzk1 commented Jul 29, 2018

Ok, tried that:

 void arrayOut(float16* arr, int sz, __global float16* out, int stride)
           {
                vc4cl_mutex_lock();
                for (int i = 0; i < sz; ++i)
                {
                   __global float* ptr = (__global float*)(out + stride * i);
                   vc4cl_dma_write(ptr + 0, arr[i].s0);
                   vc4cl_dma_write(ptr + 1, arr[i].s1);
                   vc4cl_dma_write(ptr + 2, arr[i].s2);
                   vc4cl_dma_write(ptr + 3, arr[i].s3);

                   vc4cl_dma_write(ptr + 4, arr[i].s4);
                   vc4cl_dma_write(ptr + 5, arr[i].s5);
                   vc4cl_dma_write(ptr + 6, arr[i].s6);
                   vc4cl_dma_write(ptr + 7, arr[i].s7);

                   vc4cl_dma_write(ptr + 8, arr[i].s8);
                   vc4cl_dma_write(ptr + 9, arr[i].s9);
                   vc4cl_dma_write(ptr + 10, arr[i].sa);
                   vc4cl_dma_write(ptr + 11, arr[i].sb);

                   vc4cl_dma_write(ptr + 12, arr[i].sc);
                   vc4cl_dma_write(ptr + 13, arr[i].sd);
                   vc4cl_dma_write(ptr + 14, arr[i].se);
                   vc4cl_dma_write(ptr + 15, arr[i].sf);

                }
                vc4cl_mutex_unlock();
           }

Gives next errors:
%call2 = tail call spir_func i32 bitcast (i32 ()* @vc4cl_dma_write to i32 (float addrspace(1), double))(float addrspace(1)* %1, double %conv) #10Failed to compiler kernels:
[W] Sun Jul 29 18:49:21 2018: 64-bit operations are not supported by the VideoCore IV architecture, further compilation may fail!
[W] Sun Jul 29 18:49:21 2018: 64-bit operations are not supported by the VideoCore IV architecture, further compilation may fail!
[W] Sun Jul 29 18:49:21 2018: 64-bit operations are not supported by the VideoCore IV architecture, further compilation may fail!
[E] Sun Jul 29 18:49:21 2018: Compiler threw exception: Parser: Unhandled type of indirect function call!

@doe300
Copy link
Owner

doe300 commented Jul 30, 2018

I do not understand why this error is thrown, vc4cl_dma_write does not return anything. Also, why does it take double arguments?

Did you declare the function before use? The error looks like it was not declared, and as OpenCL is basically C, any not declared function is assumed to be of type int (). You will need to declare void vc4cl_dma_write(volatile __global float*, float) in this case.

@alexzk1
Copy link
Author

alexzk1 commented Jul 30, 2018

No, i didn't declare anything, including mutexes.
Maybe you could change texts to something more like GCC ? Like "function is implicitly declared" or so.

Ok, forward declaration worked, but idea seems bad. 4 times slower total, than direct vstore16

@doe300
Copy link
Owner

doe300 commented Jul 30, 2018

Maybe you could change texts to something more like GCC ? Like "function is implicitly declared" or so.

For that I would need to know that the function is not declared. This is already handled by CLang by converting it to this strange bit-cast function call. From a VC4C point of view, this is just a bit-cast function call which is not supported.

4 times slower total, than direct vstore16

I din't notice it before, but why do you write every element of the vector extra? You could also use vc4cl_dma_write(volatile __global float*, float16) to write the whole vector at once.
Also, the loop with the non-constant iteration count will be a problem. If you allow clang to unroll it (e.g. by converting sz to a macro or constant), this should be faster than using multiple vstore16s.

@alexzk1
Copy link
Author

alexzk1 commented Jul 30, 2018

Well, this works slightly slower (about 20%) then original vstore16 each.

void vc4cl_dma_write(volatile __global float*, float16);
           vc4cl_mutex_lock();
           vc4cl_mutex_unlock();
           void arrayOut16(const float16* arr,  __global float16* out, const int stride)
           {
                vc4cl_mutex_lock();
                for (int i = 0; i < 16; ++i)
                {
                   __global float* ptr = (__global float*)(out + stride * i);
                   vc4cl_dma_write(ptr + 0, arr[i]);

                }
                vc4cl_mutex_unlock();
           }

@doe300
Copy link
Owner

doe300 commented Aug 5, 2018

Can you send me a full example kernel of the memory access you do? So I can better see what improvements I can make that will help you.

@alexzk1
Copy link
Author

alexzk1 commented Aug 5, 2018

Here
https://github.com/alexzk1/LitterBug-Algorithm/blob/Desktop_Litterbug/cl/kernels.cpp

hysterisis and non_maximum.

You can check history of the file to see what I tried. I use github to copy-paste code to PI, so everything is there.

@alexzk1
Copy link
Author

alexzk1 commented Aug 6, 2018

I tried 1st kernel there - with all vstrore disabled it takes 35ms, once vstrores are enabled - takes 200+ms. Not sure why they did such chip >: just useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants