I have a global domain of 2048x1365, what workgroup size should I choose? Is it possible to choose some different size then (1,1)?
And I have a general question about the workgroup size. How does it affect the runtime of my algorithm? The bigger workgroup size the faster my algorithm?
Chose 8x8 or 16x8 for example, or as much as it allows, round your global work size to multiples of these, add the original domain size as a kernel parameter and place a condition to the beginning of your kernel that terminates the kernel if it runs out of the border. Only few blocks will have disjoint paths so the loss will be nearly nothing (surely less than with 1x1 )