When collect trace from my application, the AMD APP Profiler reports:
I leave the local work size to NULL and execute the kernel in 1D NDRange.
On 5870 GPU, the MAX_WORK_GROUP_SIZE is 256, so one work-group has maximum 256 work-items or 4 wavefronts.
1. Since it has 5120 wavefronts, does it mean, it has 5120/4 = 1280 Work-group?
2. AFAIK, work-groups are distributed (equally??) to SIMD engine. 5870 has 20 SIMD engines, so each SIMD engine get 1280/20 = 64 work-groups (= 64 x 4 wavefronts = 256 wavefronts = 16384 work-items). Is it correct?
1. Yes, as long as resource usage of the kernel (register allocation and use of local memory) allows. Otherwise the workgroup size might be 128 or even 64.
2. You can simplify this and say that 5120 wavefronts are split across 20 SIMDs, which produces the answer of 256 wavefronts per SIMD. The hardware might do a slightly uneven spread though (e.g. if the kernel follows immediately after another kernel). In truth this unevenness doesn't matter.
As long as there are more than 2 wavefronts per SIMD you are getting decent performance. AMD recommends at least 3 wavefronts for ALU-heavy code. And at least 5 for memory-heavy code.