I want to know the priority used by the compiler to distribute wavefronts to SIMD engine (CU).
Assume I have 20 wavefronts (reported by profiler in Visual Studio). HD 5870 has 20 cores.
Which one is correct:
Each SIMD engine get 1 wavefront or
1 SIMD engine get 4 wavefronts (so, 5 SIMD engines are used, the remaining 15 SIMD engines do nothing (idle).
The reason I asked the question above:
I experienced two cases in my exeperiments (local work size is set to NULL).
If the total number of work items (global work size) is large, the number of wavefronts reported by profiler (after I do some math), I know that 1 wavefront is 64 work-items (full)
If the total number of work-items is not very large, the compiler chose only to half-fill the wavefront (1 wavefront is 32 work-items), so the number of wavefronts reported is large enough. It seems the compiler choose to have more number of wavefront (although it's half-filled/32) than less number of wavefront (full-filled/64). Is it correct?
I hope someone can help me with this question. I'm writing a school report, so I don't want to write wrong information in the report.
I just note that it is not compiler but AMD APP Runtime (Catalyst driver) that defines local worksize in case none is specified in enqeueNDRange.