6 Replies Latest reply on Jun 23, 2011 8:17 AM by nou

    5 short questions

    omgi

      1. Is "Wave front" and "Wave front granularity" with AMD equivalent to "Warp size" and "warp size granularity" with nVidia?

      2. When creating a new variable in a kernel and not exclusively using "private/local/global/const/..." in declaration, for example "float newVar;", in what memory is it created, and what is the priority? Is it automatically global?

      3. Is there any way to estimate how much private memory I have on my GPUs (nVidia GTX 470 and ATI HD5850)?

      4. Is there any particular reason to use 2D or 3D work groups, other than it might be easier/prettier to map the threads to the work space? Performance gain for example?

      5. Lets say that I want to operate on many small vectors of length 64, and my optimal work group size is 256 for my platform. Is it a bad idea (performance wise) to set group size to 32 or 64? Is it very important not to go too far below 256, and instead try to split the same work group out over different vectors? The reason why I ask is because splitting the work group up like that could potentially be bad in some aspects in my implementation.
        • 5 short questions
          Meteorhead

          1. Yes

          2. variables inside kernels by default go to __private mem space, so they will be registers.

          3. Only documentations will tell you. If you use too many registers, they will spill over to __global, and the compiler will try to load them in time so you do not see the difference. However, compilers cannot do magic.

          4. Yes, there is a certain speed gain in case you use indexers a lot. You can index a 3 dimensional space using 1 dim workdimension, but you will need extra computation to access elements of the volume as opposed to having 3 dim workdimension, you need not compute indexes.

          5. You do not have to rearrange your kernel. If multiple workgroups fit onto the same CU, they will operate next to each other. A CU by definition is a group of processors that hold shareable resources to a workgroup. It does not state that a CU is exclusive to one workgroup. As long as they fit next to each other in terms of registers and local memory, they will operate next to each other.

            • 5 short questions
              omgi

              Thank you for the quick reply Meteorhead! Follow up questions:

              2. Is it important then to declare all my private variables first, or will my compiler take the variables that are explicitly declared as "__private" first?

              4. So practically, the only performance benefit of using 2D/3D workdim is the computation of index?

            • 5 short questions
              MicahVillmow
              omgi,
              2) this doesn't matter, spilling only occurs when too many registers are required, which is not directly related to the number of private variables.
              4) The compiler will generate index computation no what what dimension the programmer specifies. Because different hardware generate indices differently, one method might be faster on hardware than another method. Use what is beneficial for your algorithm and let the compiler deal with calculations.
                • 5 short questions
                  omgi

                  Thank you MicahVillmow and rick.weber.

                  If my local memory size is 16 kB and I use a 2D workgroup (256,32), I will still only have 16 kB to my disposal for all 256x32 elements, right? Not 32x16 or 256x16?

                   

                  Originally posted by: MicahVillmow omgi, 2) this doesn't matter, spilling only occurs when too many registers are required, which is not directly related to the number of private variables. 4) The compiler will generate index computation no what what dimension the programmer specifies. Because different hardware generate indices differently, one method might be faster on hardware than another method. Use what is beneficial for your algorithm and let the compiler deal with calculations.


                  I'm not sure how I shall formulate this question, but how am I supposed to "think" about my situation when using variables for private, so that they do not spill over to global? What precautions and limitations should I use? Is it just to calculate the total size of the variables that are private and make sure that it is not greater than my private memory size?

                    • 5 short questions
                      nou

                      only reliable way to get register/private memory usage is compile kernel and look at compiled code. you can get register usage from AMD kernel analyzer or when you dump ISA code you can find number of register at end of ISA code.