Can we have global work size a multiple of 16? #101

Anaphory · 2021-02-12T09:19:10Z

Out of a toy interest, I am trying to run OpenCL and the tree likelihood computation library BEAGLE to run on a PI. BEAGLE assumes that work sizes are divisible by 16, because that's handy for nucleotide substitution matrices, and it fails to run on the 12×12×12 work size limit of VC4CL on the Pi.

Unfortutately, I don't know much about low-level programming and hardware (and I really don't understand any of OpenCL, the Pi's GPU architecture, or what what the work size actually means, sorry), so the question I ask may be a bit dumb:
Would it be possible to change the work size?

I have been looking for the source of the magic number here in the repository and found this comment

VC4CL/src/vc4cl_config.h

Lines 140 to 143 in 842d444

    
                    * "The work-items in a given work-group execute concurrently on the processing elements of a single compute 
        
                    * unit." (page 24) Since there is no limitation, that work-groups need to be executed in parallel, we set 1 
        
                    * compute unit with all 12 QPUs, allowing us to run 12 work-items in a single work-group in parallel and run 
        
                    * the work-groups sequentially.

If work items can in part be executed sequentially – could I be taught to set some of the work size limits to 48 (the lcm of 12 and 16) for a small performance hit, or is that number embedded too deeply in the code and would require a lot of changes in other places? Like

VC4CL/src/Kernel.cpp

Line 339 in a00572f

    
           V3D::instance()->getSystemInfo(SystemInfo::QPU_COUNT), param_value_size, param_value, param_value_size_ret);

doe300 · 2021-02-12T12:23:04Z

I think you misunderstood the comment. work-groups can be run sequentially, work-items (single executions within a work-group) must be run in parallel.

The 12 for work-group size (number of work-items in a single work-group) is a hardware/implementation limitation, since we only have 12 cores.

I am currently working on a very experimental optimization to merge work-items, which would then allow for a work-group of more than 12. But whether this can be applied depends on the kernels being executed...

Anaphory · 2021-02-12T12:42:10Z

Yes, there's probably a lot of confusion in my head about these things. Thank you very much for engaging nonetheless!

doe300 added the question label Feb 12, 2021

doe300 mentioned this issue Feb 12, 2021

Towards Beagle+OpenCL on a Raspberry Pi beagle-dev/beagle-lib#157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we have global work size a multiple of 16? #101

Can we have global work size a multiple of 16? #101

Anaphory commented Feb 12, 2021

doe300 commented Feb 12, 2021

Anaphory commented Feb 12, 2021

Can we have global work size a multiple of 16? #101

Can we have global work size a multiple of 16? #101

Comments

Anaphory commented Feb 12, 2021

doe300 commented Feb 12, 2021

Anaphory commented Feb 12, 2021