Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we have global work size a multiple of 16? #101

Open
Anaphory opened this issue Feb 12, 2021 · 2 comments
Open

Can we have global work size a multiple of 16? #101

Anaphory opened this issue Feb 12, 2021 · 2 comments
Labels

Comments

@Anaphory
Copy link

Out of a toy interest, I am trying to run OpenCL and the tree likelihood computation library BEAGLE to run on a PI. BEAGLE assumes that work sizes are divisible by 16, because that's handy for nucleotide substitution matrices, and it fails to run on the 12×12×12 work size limit of VC4CL on the Pi.

Unfortutately, I don't know much about low-level programming and hardware (and I really don't understand any of OpenCL, the Pi's GPU architecture, or what what the work size actually means, sorry), so the question I ask may be a bit dumb:
Would it be possible to change the work size?

I have been looking for the source of the magic number here in the repository and found this comment

VC4CL/src/vc4cl_config.h

Lines 140 to 143 in 842d444

* "The work-items in a given work-group execute concurrently on the processing elements of a single compute
* unit." (page 24) Since there is no limitation, that work-groups need to be executed in parallel, we set 1
* compute unit with all 12 QPUs, allowing us to run 12 work-items in a single work-group in parallel and run
* the work-groups sequentially.

If work items can in part be executed sequentially – could I be taught to set some of the work size limits to 48 (the lcm of 12 and 16) for a small performance hit, or is that number embedded too deeply in the code and would require a lot of changes in other places? Like
V3D::instance()->getSystemInfo(SystemInfo::QPU_COUNT), param_value_size, param_value, param_value_size_ret);

@doe300
Copy link
Owner

doe300 commented Feb 12, 2021

I think you misunderstood the comment. work-groups can be run sequentially, work-items (single executions within a work-group) must be run in parallel.

The 12 for work-group size (number of work-items in a single work-group) is a hardware/implementation limitation, since we only have 12 cores.

I am currently working on a very experimental optimization to merge work-items, which would then allow for a work-group of more than 12. But whether this can be applied depends on the kernels being executed...

@Anaphory
Copy link
Author

Yes, there's probably a lot of confusion in my head about these things. Thank you very much for engaging nonetheless!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants