cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

registerme
Journeyman III

Strange Memory Access Performance Result

I have an OpenCL kernel that uses no local memory at all. Inside the kernel each work item copies about 128 bytes from global memory to registers (private variables), and then the values are accessed hundreds of times - this is only very small amount of memory access compared to other much larger amount of global/image memory access. Strangely, if I use the local memory instead of registers, I saw a performance boost of 33%. Each work item can actually use the same data from global memory so using the local memory to share the data is fine here. I also tried to use constant memory instead of global memory, and do not copy to register or local, the performance is not good.

Can somebody explain why this could happen? Note that the register usage for each work item is very small, the code should have enought registers to hold this 128 bytes.

0 Likes
25 Replies
dmeiser
Elite

One thing to note is that whenever you use a variable in a computation it will be brought into registers.  If you have data in local memory and perform some arithmetic operation on it the data will first be brought into registers and the result is written back to local memory after the computation is done.

A possible explanation of the observed speedup is that you reduce register pressure in your kernel. Perhaps your kernel runs out of registers and spills some of them to global memory.  This register spilling may be prevented if you keep some of your data in local memory, hence the speedup.

However, without further details (kernel code, launch configuration, and hardware info) this is just speculation.

Cheers,

Dominic

0 Likes
nathan1986
Adept II

Hi, registerme,

    Which GPU are you using? my experience is that many AMD GPUs allocate the private buffer as global buffer when the private buffer is large to save the register, so I feel that it is the reason why you didn't see the speedup when using private buffer. another, how to you  access the constant memory, if the indices you access is random, the constant memory still resides in global memory which is decided in compile time.

0 Likes

It's HD 7970. I used minimum private variables, unlike to spill.

Access to constant memory is sequential, and should be the same order across work items. How did you see the the constant memory is in global memory?

0 Likes
notzed
Challenger

Use sprofile to see what's going on, e.g. if there are register spillage, how many registers are actually needed by the code, etc.

It is probably register spillage (=slower execution) or a lower register usage count (=higher parallelism available) affecting the performance.  Could even be related to the memory access patterns.  If these private variables are arrays they might not be registerised at all.

Such a difference for different code is not terribly strange.

0 Likes

The VGPRs=76 and SGPRs=36 when I am using local memory. When I change the 128 byte of data to private memory, VGPRs=140 and SGPRs=54. Do you see register spillage here? I actually used very few private variables, it's hard to imagine anyone can write any code with even less variables. I am also only using 64 work items per work group, this card has 256k register memory for each workgroup, it should be sufficient for each work item.

These private variables are arrays. Why would it not be registered at all? Are they put into global memory?

0 Likes

     The private array has much possibility in the global memory in AMD video card, In my experience, I always get some speedup when I remove the private array by using other way.

   The constant  memory should better be accessed the same index in each thread in a wavefront, you can look into the "constant bandwidth" sample in AMD SDK, the lowest bandwidth is the same with global bandwidth. you can check which access approach you are using now.

0 Likes

If there is enough private memory, why would AMD move it to global memory?

0 Likes

"private memory" just means registers.  One reason it is moved to global memory is hardware usually doesn't have a way of indexing registers at run-time.

e.g. if you use a "private" array where the indexing is not known at compile time, then that private array has to be stored in global memory because the cpu core simply has no way of indexing registers.  (assuming you don't have some exotic cpu that allows this).

In short, arrays can only be stored completely in registers if the array indices are known at compile time.

And I think register spillage is indicated by the sprofile column 'ScratchRegs'.

0 Likes

I have checked that the constant bandwidth samples in AMD SDK are indicative of low band width in keeping with global bandwidth

0 Likes

Even if you're not "running out" of registers increased register usage can lead to performance degradation because it can reduce the occupancy of the compute units.  Roughly speaking, if each thread uses more registers fewer threads can run on a compute unit at any given time.  In general, GPUs run the most efficiently if many threads can run on a compute unit simultaneously because this can be used to hide latencies.

0 Likes

Is there a quantitive way to estimate how many registers can be used by each work item? AMD document says each workgroup has 256k private memory, and I only have 64 work items in the workgroup. I am using way less than the available private memory.

I did see that the Number of Waves limited by VGPRs is 4 when VGPR=140, i.e., when I am runing the non-local version, but I don't know how to interprete this. Is this 140 number of bytes? But the device limit is 256, which looks like the unit here is kilobytes. In that case adding 128 bytes of private memory leads to 140-76=64k private memory. This is something I don't understand.

0 Likes

Yes, there is a quantitative way to figure out how many registers can be used by each work item. However, there is another variable entering this calculation and that is the total number of wavefronts executing on a given compute unit.  Even if you have a workgroup size of 64 this doesn't mean that just 64 work items are being executed on a compute unit.  The hardware will in general schedule many workgroups to a given compute unit concurrently.

So, in general, you have to satisfy (only looking at VGPRs for simplicity)

NumWorkGroupsPerComputeUnit * WorkGroupSize * NumVGPRs * SizeOfVGPRs < RegisterMemoryPerComputeUnit

There is several ways to read this. If you fix the number of work items per compute unit (this is NumWorkGroupsPerComuteUnit * WorkGroupSize) then this relation gives an upper bound on the number of VGPRs you can use per work item. If the number of VGPRs is fixed you end up with an upper bound on the number of work items that can be scheduled on a compute unit.  This latter scenario is what I was referring to in my previous response.  If your kernel has an increased VGPR usage NumWorkGroupsPerComputeUnit will have to go down.

As you mentioned, on a 7970 RegisterMemoryPerComputeUnit is 256K and the size of a VGPR is 16B.

0 Likes

In this case what does 140 mean? How can you know NumVGPRs * SizeOfVGPRs?

0 Likes

What program reported the 140? I'd suspect that it's the number of VGPRs per work item. So, to get the amount of register memory per work item (which is NumVGPRs * SizeOfVGPRs) you'd multiply by 16B.

0 Likes

I am using the app profiler.

NumWorkGroupsPerComputeUnit * WorkGroupSize * NumVGPRs * SizeOfVGPRs < RegisterMemoryPerComputeUnit

=NumWorkGroupsPerComputeUnit * 64 * 140 *16 < 256k

then NumWorkGroupsPerComputeUnit = 1

Does not make sense.

0 Likes

This web site has some info on how to compute occupancy of the compute units for a given resource usage:

http://developer.amd.com/tools/AMDAPPProfiler/html/clkerneloccupancydescriptionpage.html

0 Likes

Thanks for the link, it did explain a bit in detail, but I still can not map the numbers I have here to what's stated in that document.

number of VGPR per SIMD = WFmax,vgpr = VGPRmax / VGPRused

where VGPRmax is maximum number of registers per work item, VGPRused is the actual number of registers used per work item.

scale to per CU, 

number of VGPR per SIMD = WFmax,vgpr * SIMDperCU

Can you tell me what the numbers would be for my card? Is SIMDperCU=4? VGRPused is probably 140?

I think the number of wavefront per workgroup is 1 as I used 64 as workgroup size. So the resulting number of wavefront limited by VGPR = number of VGPR per SIMD, and per the profiler, it's 4. Can you explain how to get this number?

I don't know why this thread is marked as "assume answered", it's really not.

0 Likes

yes, on a 7970 the number of SIMD per CU is 4. Wavefront size is 64, so yes you have one wavefront per workgroup. What they seem to be saying at the occupancy calculation page I linked above is that you should compute the number of wavefronts the way you did above (which resulted in 1 wavefront per compute unit) and then multiply that by the number of SIMD units per CU. In your example this comes out to be 4. I don't understand the architecture of 7970 well enough so I can't tell you why you need to multiply by the number of SIMDs.

registerme
Journeyman III

It looks like it's very hard to get 40 active wavefronts for the HD 7000 series card. The restrictions on the registers and LDS is really tight. I am using really small amount of registers and it can easily get to only 4 active wavefronts. My question is then, how big is the impact on less active wavefronts? 4 wavefronts might be too small, but would something like 8, 12 be ok? The only reason to have more wavefront as I can see is to hide the memory latency. Is there any other indicator I can use to see if the active wavefronts actually is limiting the performance?

I still don't know if my calculation is correct and if my interpretation of numbers is correct. Hope somebody knows the AMD architecture can have some input.

0 Likes

Are you sure you are not confusing active wavefronts per compute unit versus active wavefronts on the device? Usually you want 4 wavefronts per compute unit, but the number of compute units can be variable depending on your card.

0 Likes

I am talking about active wavefronts on the compute unit, not device. Look at the kernel occupancy in the profiler, I don't understand why removing the private variable int2 priv_var[16] can change the VGPRs from 140 to 76. And this 140 number makes the active wavefronts to be 4, not sure how it's calculated.

0 Likes

On a GCN-based GPU each VGPR is 32-bits per lane, 256 bytes per wave. An int2 takes two of them. An array of 16 int2s takes 32, assuming it's statically indexed. You appear to be seeing exactly twice that... maybe the compiler is being inefficient and doubling up, somewhere (or maybe my brain isn't working right at this time of night).

Let's say you have 140GPRs, though. There are 256 rows in the register file - 256 registers per work item per SIMD unit. 140 is more than half of that, so you can only have one wave per SIMD unit (maximum 50% utilisation there, I think, because the SIMD unit still runs two waves simultaneously... may be wrong on that). The GCN core has four SIMD units that use different banks of the register file and also have 256 registers each. You can therefore have a wave on each SIMD unit, 4 waves per core/CU.

0 Likes

LeeHowes wrote:

On a GCN-based GPU each VGPR is 32-bits per lane, 256 bytes per wave. An int2 takes two of them. An array of 16 int2s takes 32, assuming it's statically indexed. You appear to be seeing exactly twice that... maybe the compiler is being inefficient and doubling up, somewhere (or maybe my brain isn't working right at this time of night).

Let's say you have 140GPRs, though. There are 256 rows in the register file - 256 registers per work item per SIMD unit. 140 is more than half of that, so you can only have one wave per SIMD unit (maximum 50% utilisation there, I think, because the SIMD unit still runs two waves simultaneously... may be wrong on that). The GCN core has four SIMD units that use different banks of the register file and also have 256 registers each. You can therefore have a wave on each SIMD unit, 4 waves per core/CU.

GCN-based GPUs do not run two waves simultaneously per SIMD as happened on EG/NI-based GPUs.  Please see the AMD APP Programming Guide as I have described the correct behavior there.  With GCN, typical instruction latency is 4 clocks, not 8 as on previous generations.

0 Likes

Yes it is basically impossible to exceed the available hardware limit on the number of concurrent threads (wavefronts) with any non-trivial kernels - but that's a lot better than hitting the limit all the time.  So that 40 number is just a hard physical limit, but it doesn't mean it's achievable or even desirable (more threads means more cache pressure for instance, and if that starts thrashing you're going to lose a lot).

And yes, it's only about the memory latency - and whether that matters much is very algorithm dependent.  One would normally expect something using a lot of registers not to be memory bound, and so the trade-off of lower concurrency is probably still a win.

But still for a given algorithm, being able to go from 1 wavefront at a time to 2 by a reduction in the register load as in your case, a 30% performance difference would not be out of the question. (this is of course what i meant earlier than i said lower register usage = higher parallism).

Some of the other sprofile columns give an indication of bottlenecks from memory and ALU and so on, e.g. cache hit ratio, fetch unit busy/stalled, alu utilisation, etc.  Although i'd like to know if there was better documentation on what each means, and how they relate to a multi-cu/wavefront kernel.

drallan
Challenger

If your program is ALU bound, I think the time difference you see may be because the 7970 has higher ALU throughput using 8 waves/CU compared to 4 waves/CU, not latency. With 8 waves, the 7970 can multiple issue certain instructions in the same timeslot from different waves, the most common example being scalar and vector instructions. This doesn't occur with 4 waves because 4 waves just fill the 4 SIMDs (waves are issued over 4 clocks) and multiple waves are not available in the same timeslot, or something like that. Multiple waves also reduce bottlenecks scheduling too many long instructions from the instruction stream. (Many 7970 instructions have both long and short formats (64 and 32 bit)).

As mentioned already, the no-LDS kernel uses 140 of the 256 VGPRs available so only 4 waves can run on a CU. The LDS kernel uses only 76 VGPRs so 8 waves can run on a CU so it can issue instructions more efficiently.

You can test this by limiting you job size to a total of 32*4 wavefronts and running the LDS kernel, which will be restricted to 4 waves/CU. It should run about the speed of the No-LDS kernel.

The small table below shows some typical speed ups from multiple instruction issue. The timing program executes a large block of identical groups of 4 instructions, using different 4-instruction patterns and measuring their execution time via the 7970's 64 bit timer. Times are in clocks per instruction, the first column is the 'real' time to issue the instructions in a single thread. The other times are averaged over all threads and measure throughput. The instruction pattern is listed on the right.

Instruction execution times(all time in clocks/instruction)

   <-Real-><-average of all threads-->

      +------------------------------ single thread execution time

      |       +---------------------- average time,  4 wavefronts/CU

      |       |     +---------------- average time,  8 wavefronts/CU

      |       |     |     +---------- average time, 12 wavefronts/CU

      |       |     |     |     +---- average time, 16 wavefronts/CU

      A       B     C     D     E

      |       |     |     |     |           PATTERN of 4 instructions

1)   4.0  -  1.00  1.00  1.00  1.00   (4x)v_add_f32

2)   4.0  -  1.00  0.51  0.65  0.50   (2x)v_add_f32, (2x) s_xor_b32

3)   7.0  -  1.73  1.14  1.15  1.07   (4x)v_max3_u32 [64 bit inst word]

4)  16.0  -  4.00  4.00  4.00  4.00   (4x)v_sin_f32  (float)

5)  16.0  -  4.00  4.00  4.00  4.00   (4x)v_mul_f64  (double)

6)  10.0  -  2.50  2.17  2.12  2.08   (4x)v_add_f64  (double)

7)  10.0  -  4.30  4.11  4.05  4.02   (4x) ds_write_b32 write to LDS

😎  11.0  -  2.75  1.38  1.90  1.44   (2x)v_add_f32, s_xor_b32, ds_write_b32