As I've come to know, 256 is the default max. work group size so that any OpenCL application should work properly. Whereas, AMD version (i.e.1024) shows max. HW capability and user can use that value by forcing dimensions with __attribute__((reqd_work_group_size(X, Y, Z))). However, the developers must understand the performance impact of such request.
By the way, when I ran the code you mentioned on the first post, I got the expected output on a Windows m/c. Currently, I don't have a Linux setup to test it. I have a suggestion though. Instead of printf, could you please try some other methods (such as write to a buffer) to verify that work items corresponding to get_local_id(1) are executing or not? If you still see the problem, please share the repro. I'll report it to the concerned team.
Well the printf is usually not part of my kernel. When I ran it the first time I only had the usual working code in it and wondered why it executed much faster then I anticipated. Until I realized that only a fraction of my output data was written. The kernels amount of output is a quadratic function in the number of input data and so while researching the reason I found out that only first half of input data was read and the second half - belonging to threads 256-511 was missing.
Thats why I put the printf in to see whats going on.
I will later try to set up a Windows installation on the same machine to compare if there is a difference in behavior between Linux and Windows driver.
I wanted to reply back a thing recently learned for all googlers:
It turned out that what I was asking for is well possible, but requires a (seems undocumented) environment variable to be set. Namely if I set GPU_MAX_WORKGROUP_SIZE=1024 I can go up to that value, its even shown then in cl_info.
But I wonder why there is this switch but information regarding it is very poor, but its working just fine as long as the kernel has enough resources.
Now new question:
Any chance to activate full usage of 64 kBytes local memory (instead of 32) per work group? The kernel compile well but then given an out of resources run time error if I try to use them ^^
Have a nice day
GCN 1st to 4th Gen only have 32kB Local., GCN 5th Gen has the full 64kB that was planned but scrapped, however the ISA was never updated to reflect this.
I believe it was corrected in the GCN3 (Update) ISA … although only briefly in the accompanying text not the Architecture Diagram.
Not entirely sure the reason for said reduced Caches over the ISA, but likely a cost-saving measure.
If you look at the OpenCL (easiest way is via GPU-Z)., it should list 256kB Queue • 16kB Global (GDS) • 32kB Local (LDS); which should be the same from the HD 7770 up to Polaris 30., and technically the Ryzen with Vega Graphics too, as they're Polaris as well.
RX Vega 56 / 64 / FE and Radeon VII however should list 512kB Queue • 64kB Global (GDS) • 64kB Local (LDS).
Although if you want portability, I'd stick within the bounds of GCN 1st to 4th Gen.