1. Is "Wave front" and "Wave front granularity" with AMD equivalent to "Warp size" and "warp size granularity" with nVidia?
2. When creating a new variable in a kernel and not exclusively using "private/local/global/const/..." in declaration, for example "float newVar;", in what memory is it created, and what is the priority? Is it automatically global?
2. variables inside kernels by default go to __private mem space, so they will be registers.
3. Only documentations will tell you. If you use too many registers, they will spill over to __global, and the compiler will try to load them in time so you do not see the difference. However, compilers cannot do magic.
4. Yes, there is a certain speed gain in case you use indexers a lot. You can index a 3 dimensional space using 1 dim workdimension, but you will need extra computation to access elements of the volume as opposed to having 3 dim workdimension, you need not compute indexes.
5. You do not have to rearrange your kernel. If multiple workgroups fit onto the same CU, they will operate next to each other. A CU by definition is a group of processors that hold shareable resources to a workgroup. It does not state that a CU is exclusive to one workgroup. As long as they fit next to each other in terms of registers and local memory, they will operate next to each other.
Thank you for the quick reply Meteorhead! Follow up questions:
2. Is it important then to declare all my private variables first, or will my compiler take the variables that are explicitly declared as "__private" first?
4. So practically, the only performance benefit of using 2D/3D workdim is the computation of index?
Thank you MicahVillmow and rick.weber.
If my local memory size is 16 kB and I use a 2D workgroup (256,32), I will still only have 16 kB to my disposal for all 256x32 elements, right? Not 32x16 or 256x16?
Originally posted by: MicahVillmow omgi, 2) this doesn't matter, spilling only occurs when too many registers are required, which is not directly related to the number of private variables. 4) The compiler will generate index computation no what what dimension the programmer specifies. Because different hardware generate indices differently, one method might be faster on hardware than another method. Use what is beneficial for your algorithm and let the compiler deal with calculations.
I'm not sure how I shall formulate this question, but how am I supposed to "think" about my situation when using variables for private, so that they do not spill over to global? What precautions and limitations should I use? Is it just to calculate the total size of the variables that are private and make sure that it is not greater than my private memory size?
only reliable way to get register/private memory usage is compile kernel and look at compiled code. you can get register usage from AMD kernel analyzer or when you dump ISA code you can find number of register at end of ISA code.