For example, let suppose I have kernel with global size of 1024 (single dim). And pass NULL as local workgroup dim.
Also let suppose my GPU has 20 CUs.
How implementation will distribute 1024/64=16 wavefronts in this case? Will it assign workgoup size of 64 and put 16 wavefronts=workgroups on 16 CUs leaving 4 CU idle?
Or it will use (assuming that kernel so simple that workgroup size of 256 is possible) max possible wokgroup size and creates 1024/256=4 workgroups, put it on 4CUs and leave other 16 CUs idle? It's important question for me cause algorithm I have now often use such small global sizes so 16 CU used or 4 CU used can make big difference for my app.
So, how AMD's OpenCL implementation will behave in this case?
And another question - how to look on its choice? CodeXL just shows NULL as WG size param and doesn't give any insight on what real workgroup size was used for this particular kernel invocation...