25 Replies Latest reply on Sep 3, 2012 4:17 AM by rocky67

    Strange Memory Access Performance Result

    registerme

      I have an OpenCL kernel that uses no local memory at all. Inside the kernel each work item copies about 128 bytes from global memory to registers (private variables), and then the values are accessed hundreds of times - this is only very small amount of memory access compared to other much larger amount of global/image memory access. Strangely, if I use the local memory instead of registers, I saw a performance boost of 33%. Each work item can actually use the same data from global memory so using the local memory to share the data is fine here. I also tried to use constant memory instead of global memory, and do not copy to register or local, the performance is not good.

       

      Can somebody explain why this could happen? Note that the register usage for each work item is very small, the code should have enought registers to hold this 128 bytes.

        • Re: Strange Memory Access Performance Result
          dmeiser

          One thing to note is that whenever you use a variable in a computation it will be brought into registers.  If you have data in local memory and perform some arithmetic operation on it the data will first be brought into registers and the result is written back to local memory after the computation is done.

           

          A possible explanation of the observed speedup is that you reduce register pressure in your kernel. Perhaps your kernel runs out of registers and spills some of them to global memory.  This register spilling may be prevented if you keep some of your data in local memory, hence the speedup.

           

          However, without further details (kernel code, launch configuration, and hardware info) this is just speculation.

           

          Cheers,

          Dominic

          • Re: Strange Memory Access Performance Result
            nathan1986

            Hi, registerme,

                Which GPU are you using? my experience is that many AMD GPUs allocate the private buffer as global buffer when the private buffer is large to save the register, so I feel that it is the reason why you didn't see the speedup when using private buffer. another, how to you  access the constant memory, if the indices you access is random, the constant memory still resides in global memory which is decided in compile time.

            • Re: Strange Memory Access Performance Result
              notzed

              Use sprofile to see what's going on, e.g. if there are register spillage, how many registers are actually needed by the code, etc.

               

              It is probably register spillage (=slower execution) or a lower register usage count (=higher parallelism available) affecting the performance.  Could even be related to the memory access patterns.  If these private variables are arrays they might not be registerised at all.

               

              Such a difference for different code is not terribly strange.

                • Re: Strange Memory Access Performance Result
                  registerme

                  The VGPRs=76 and SGPRs=36 when I am using local memory. When I change the 128 byte of data to private memory, VGPRs=140 and SGPRs=54. Do you see register spillage here? I actually used very few private variables, it's hard to imagine anyone can write any code with even less variables. I am also only using 64 work items per work group, this card has 256k register memory for each workgroup, it should be sufficient for each work item.

                   

                  These private variables are arrays. Why would it not be registered at all? Are they put into global memory?

                    • Re: Strange Memory Access Performance Result
                      nathan1986

                           The private array has much possibility in the global memory in AMD video card, In my experience, I always get some speedup when I remove the private array by using other way.

                         The constant  memory should better be accessed the same index in each thread in a wavefront, you can look into the "constant bandwidth" sample in AMD SDK, the lowest bandwidth is the same with global bandwidth. you can check which access approach you are using now.

                      • Re: Strange Memory Access Performance Result
                        dmeiser

                        Even if you're not "running out" of registers increased register usage can lead to performance degradation because it can reduce the occupancy of the compute units.  Roughly speaking, if each thread uses more registers fewer threads can run on a compute unit at any given time.  In general, GPUs run the most efficiently if many threads can run on a compute unit simultaneously because this can be used to hide latencies.

                          • Re: Strange Memory Access Performance Result
                            registerme

                            Is there a quantitive way to estimate how many registers can be used by each work item? AMD document says each workgroup has 256k private memory, and I only have 64 work items in the workgroup. I am using way less than the available private memory.

                             

                            I did see that the Number of Waves limited by VGPRs is 4 when VGPR=140, i.e., when I am runing the non-local version, but I don't know how to interprete this. Is this 140 number of bytes? But the device limit is 256, which looks like the unit here is kilobytes. In that case adding 128 bytes of private memory leads to 140-76=64k private memory. This is something I don't understand.

                              • Re: Strange Memory Access Performance Result
                                dmeiser

                                Yes, there is a quantitative way to figure out how many registers can be used by each work item. However, there is another variable entering this calculation and that is the total number of wavefronts executing on a given compute unit.  Even if you have a workgroup size of 64 this doesn't mean that just 64 work items are being executed on a compute unit.  The hardware will in general schedule many workgroups to a given compute unit concurrently.

                                 

                                So, in general, you have to satisfy (only looking at VGPRs for simplicity)

                                 

                                NumWorkGroupsPerComputeUnit * WorkGroupSize * NumVGPRs * SizeOfVGPRs < RegisterMemoryPerComputeUnit

                                 

                                There is several ways to read this. If you fix the number of work items per compute unit (this is NumWorkGroupsPerComuteUnit * WorkGroupSize) then this relation gives an upper bound on the number of VGPRs you can use per work item. If the number of VGPRs is fixed you end up with an upper bound on the number of work items that can be scheduled on a compute unit.  This latter scenario is what I was referring to in my previous response.  If your kernel has an increased VGPR usage NumWorkGroupsPerComputeUnit will have to go down.

                                 

                                As you mentioned, on a 7970 RegisterMemoryPerComputeUnit is 256K and the size of a VGPR is 16B.

                        • Re: Strange Memory Access Performance Result
                          registerme

                          It looks like it's very hard to get 40 active wavefronts for the HD 7000 series card. The restrictions on the registers and LDS is really tight. I am using really small amount of registers and it can easily get to only 4 active wavefronts. My question is then, how big is the impact on less active wavefronts? 4 wavefronts might be too small, but would something like 8, 12 be ok? The only reason to have more wavefront as I can see is to hide the memory latency. Is there any other indicator I can use to see if the active wavefronts actually is limiting the performance?

                           

                          I still don't know if my calculation is correct and if my interpretation of numbers is correct. Hope somebody knows the AMD architecture can have some input.

                            • Re: Strange Memory Access Performance Result
                              MicahVillmow

                              Are you sure you are not confusing active wavefronts per compute unit versus active wavefronts on the device? Usually you want 4 wavefronts per compute unit, but the number of compute units can be variable depending on your card.

                                • Re: Strange Memory Access Performance Result
                                  registerme

                                  I am talking about active wavefronts on the compute unit, not device. Look at the kernel occupancy in the profiler, I don't understand why removing the private variable int2 priv_var[16] can change the VGPRs from 140 to 76. And this 140 number makes the active wavefronts to be 4, not sure how it's calculated.

                                    • Re: Strange Memory Access Performance Result
                                      LeeHowes

                                      On a GCN-based GPU each VGPR is 32-bits per lane, 256 bytes per wave. An int2 takes two of them. An array of 16 int2s takes 32, assuming it's statically indexed. You appear to be seeing exactly twice that... maybe the compiler is being inefficient and doubling up, somewhere (or maybe my brain isn't working right at this time of night).

                                       

                                      Let's say you have 140GPRs, though. There are 256 rows in the register file - 256 registers per work item per SIMD unit. 140 is more than half of that, so you can only have one wave per SIMD unit (maximum 50% utilisation there, I think, because the SIMD unit still runs two waves simultaneously... may be wrong on that). The GCN core has four SIMD units that use different banks of the register file and also have 256 registers each. You can therefore have a wave on each SIMD unit, 4 waves per core/CU.

                                        • Re: Strange Memory Access Performance Result
                                          jeff_golds

                                          LeeHowes wrote:

                                           

                                          On a GCN-based GPU each VGPR is 32-bits per lane, 256 bytes per wave. An int2 takes two of them. An array of 16 int2s takes 32, assuming it's statically indexed. You appear to be seeing exactly twice that... maybe the compiler is being inefficient and doubling up, somewhere (or maybe my brain isn't working right at this time of night).

                                           

                                          Let's say you have 140GPRs, though. There are 256 rows in the register file - 256 registers per work item per SIMD unit. 140 is more than half of that, so you can only have one wave per SIMD unit (maximum 50% utilisation there, I think, because the SIMD unit still runs two waves simultaneously... may be wrong on that). The GCN core has four SIMD units that use different banks of the register file and also have 256 registers each. You can therefore have a wave on each SIMD unit, 4 waves per core/CU.

                                           

                                          GCN-based GPUs do not run two waves simultaneously per SIMD as happened on EG/NI-based GPUs.  Please see the AMD APP Programming Guide as I have described the correct behavior there.  With GCN, typical instruction latency is 4 clocks, not 8 as on previous generations.

                                    • Re: Strange Memory Access Performance Result
                                      notzed

                                      Yes it is basically impossible to exceed the available hardware limit on the number of concurrent threads (wavefronts) with any non-trivial kernels - but that's a lot better than hitting the limit all the time.  So that 40 number is just a hard physical limit, but it doesn't mean it's achievable or even desirable (more threads means more cache pressure for instance, and if that starts thrashing you're going to lose a lot).

                                       

                                      And yes, it's only about the memory latency - and whether that matters much is very algorithm dependent.  One would normally expect something using a lot of registers not to be memory bound, and so the trade-off of lower concurrency is probably still a win.

                                       

                                      But still for a given algorithm, being able to go from 1 wavefront at a time to 2 by a reduction in the register load as in your case, a 30% performance difference would not be out of the question. (this is of course what i meant earlier than i said lower register usage = higher parallism).

                                       

                                      Some of the other sprofile columns give an indication of bottlenecks from memory and ALU and so on, e.g. cache hit ratio, fetch unit busy/stalled, alu utilisation, etc.  Although i'd like to know if there was better documentation on what each means, and how they relate to a multi-cu/wavefront kernel.

                                    • Re: Strange Memory Access Performance Result
                                      drallan

                                      If your program is ALU bound, I think the time difference you see may be because the 7970 has higher ALU throughput using 8 waves/CU compared to 4 waves/CU, not latency. With 8 waves, the 7970 can multiple issue certain instructions in the same timeslot from different waves, the most common example being scalar and vector instructions. This doesn't occur with 4 waves because 4 waves just fill the 4 SIMDs (waves are issued over 4 clocks) and multiple waves are not available in the same timeslot, or something like that. Multiple waves also reduce bottlenecks scheduling too many long instructions from the instruction stream. (Many 7970 instructions have both long and short formats (64 and 32 bit)).

                                       

                                      As mentioned already, the no-LDS kernel uses 140 of the 256 VGPRs available so only 4 waves can run on a CU. The LDS kernel uses only 76 VGPRs so 8 waves can run on a CU so it can issue instructions more efficiently.

                                       

                                      You can test this by limiting you job size to a total of 32*4 wavefronts and running the LDS kernel, which will be restricted to 4 waves/CU. It should run about the speed of the No-LDS kernel.

                                       

                                      The small table below shows some typical speed ups from multiple instruction issue. The timing program executes a large block of identical groups of 4 instructions, using different 4-instruction patterns and measuring their execution time via the 7970's 64 bit timer. Times are in clocks per instruction, the first column is the 'real' time to issue the instructions in a single thread. The other times are averaged over all threads and measure throughput. The instruction pattern is listed on the right.

                                       

                                      Instruction execution times(all time in clocks/instruction)

                                         <-Real-><-average of all threads-->

                                            +------------------------------ single thread execution time

                                            |       +---------------------- average time,  4 wavefronts/CU

                                            |       |     +---------------- average time,  8 wavefronts/CU

                                            |       |     |     +---------- average time, 12 wavefronts/CU

                                            |       |     |     |     +---- average time, 16 wavefronts/CU

                                            A       B     C     D     E

                                            |       |     |     |     |           PATTERN of 4 instructions

                                      1)   4.0  -  1.00  1.00  1.00  1.00   (4x)v_add_f32

                                      2)   4.0  -  1.00  0.51  0.65  0.50   (2x)v_add_f32, (2x) s_xor_b32

                                      3)   7.0  -  1.73  1.14  1.15  1.07   (4x)v_max3_u32 [64 bit inst word]

                                      4)  16.0  -  4.00  4.00  4.00  4.00   (4x)v_sin_f32  (float)

                                      5)  16.0  -  4.00  4.00  4.00  4.00   (4x)v_mul_f64  (double)

                                      6)  10.0  -  2.50  2.17  2.12  2.08   (4x)v_add_f64  (double)

                                      7)  10.0  -  4.30  4.11  4.05  4.02   (4x) ds_write_b32 write to LDS

                                      8)  11.0  -  2.75  1.38  1.90  1.44   (2x)v_add_f32, s_xor_b32, ds_write_b32