cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dmeiser
Elite

Re: Strange Memory Access Performance Result

Yes, there is a quantitative way to figure out how many registers can be used by each work item. However, there is another variable entering this calculation and that is the total number of wavefronts executing on a given compute unit.  Even if you have a workgroup size of 64 this doesn't mean that just 64 work items are being executed on a compute unit.  The hardware will in general schedule many workgroups to a given compute unit concurrently.

So, in general, you have to satisfy (only looking at VGPRs for simplicity)

NumWorkGroupsPerComputeUnit * WorkGroupSize * NumVGPRs * SizeOfVGPRs < RegisterMemoryPerComputeUnit

There is several ways to read this. If you fix the number of work items per compute unit (this is NumWorkGroupsPerComuteUnit * WorkGroupSize) then this relation gives an upper bound on the number of VGPRs you can use per work item. If the number of VGPRs is fixed you end up with an upper bound on the number of work items that can be scheduled on a compute unit.  This latter scenario is what I was referring to in my previous response.  If your kernel has an increased VGPR usage NumWorkGroupsPerComputeUnit will have to go down.

As you mentioned, on a 7970 RegisterMemoryPerComputeUnit is 256K and the size of a VGPR is 16B.

0 Likes
registerme
Journeyman III

Re: Strange Memory Access Performance Result

In this case what does 140 mean? How can you know NumVGPRs * SizeOfVGPRs?

0 Likes
dmeiser
Elite

Re: Strange Memory Access Performance Result

What program reported the 140? I'd suspect that it's the number of VGPRs per work item. So, to get the amount of register memory per work item (which is NumVGPRs * SizeOfVGPRs) you'd multiply by 16B.

0 Likes
registerme
Journeyman III

Re: Strange Memory Access Performance Result

I am using the app profiler.

NumWorkGroupsPerComputeUnit * WorkGroupSize * NumVGPRs * SizeOfVGPRs < RegisterMemoryPerComputeUnit

=NumWorkGroupsPerComputeUnit * 64 * 140 *16 < 256k

then NumWorkGroupsPerComputeUnit = 1

Does not make sense.

0 Likes
dmeiser
Elite

Re: Strange Memory Access Performance Result

This web site has some info on how to compute occupancy of the compute units for a given resource usage:

http://developer.amd.com/tools/AMDAPPProfiler/html/clkerneloccupancydescriptionpage.html

0 Likes
registerme
Journeyman III

Re: Strange Memory Access Performance Result

Thanks for the link, it did explain a bit in detail, but I still can not map the numbers I have here to what's stated in that document.

number of VGPR per SIMD = WFmax,vgpr = VGPRmax / VGPRused

where VGPRmax is maximum number of registers per work item, VGPRused is the actual number of registers used per work item.

scale to per CU, 

number of VGPR per SIMD = WFmax,vgpr * SIMDperCU

Can you tell me what the numbers would be for my card? Is SIMDperCU=4? VGRPused is probably 140?

I think the number of wavefront per workgroup is 1 as I used 64 as workgroup size. So the resulting number of wavefront limited by VGPR = number of VGPR per SIMD, and per the profiler, it's 4. Can you explain how to get this number?

I don't know why this thread is marked as "assume answered", it's really not.

0 Likes
dmeiser
Elite

Re: Strange Memory Access Performance Result

yes, on a 7970 the number of SIMD per CU is 4. Wavefront size is 64, so yes you have one wavefront per workgroup. What they seem to be saying at the occupancy calculation page I linked above is that you should compute the number of wavefronts the way you did above (which resulted in 1 wavefront per compute unit) and then multiply that by the number of SIMD units per CU. In your example this comes out to be 4. I don't understand the architecture of 7970 well enough so I can't tell you why you need to multiply by the number of SIMDs.

registerme
Journeyman III

Re: Strange Memory Access Performance Result

It looks like it's very hard to get 40 active wavefronts for the HD 7000 series card. The restrictions on the registers and LDS is really tight. I am using really small amount of registers and it can easily get to only 4 active wavefronts. My question is then, how big is the impact on less active wavefronts? 4 wavefronts might be too small, but would something like 8, 12 be ok? The only reason to have more wavefront as I can see is to hide the memory latency. Is there any other indicator I can use to see if the active wavefronts actually is limiting the performance?

I still don't know if my calculation is correct and if my interpretation of numbers is correct. Hope somebody knows the AMD architecture can have some input.

0 Likes
MicahVillmow
Staff
Staff

Re: Strange Memory Access Performance Result

Are you sure you are not confusing active wavefronts per compute unit versus active wavefronts on the device? Usually you want 4 wavefronts per compute unit, but the number of compute units can be variable depending on your card.

0 Likes
registerme
Journeyman III

Re: Strange Memory Access Performance Result

I am talking about active wavefronts on the compute unit, not device. Look at the kernel occupancy in the profiler, I don't understand why removing the private variable int2 priv_var[16] can change the VGPRs from 140 to 76. And this 140 number makes the active wavefronts to be 4, not sure how it's calculated.

0 Likes