Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Questions about ACML-GPU - SGEMM Optimization Illustration

The following questions are about this lecture:

1.  In page 7, how to calculate Theoretical Peak (GFlops) of my kernel?

2.  Also in page 7, look at the perf workbook:

Perf Workbook(1k*1k):

  • Total Pixels: 262144
  • Total ALU: 4642
  • Total Tex: 2048
  • Total Out: 4
  • ALU Time: 25.3515 ms
  • Tex Time: 44.7392 ms
  • 0% Bandwidth: 116.9390 ms
  • 75% Bandwidth: 52.0302 ms
0% cache hit rate – Bandwidth
Theoretical Peak: 40.7193 Gflops
75% cache hit rate – Tex
Theoretical Peak: 140.0146 Gflops
1) How to calculate the ALU time and Tex time (with underline)? 
2) What do "0% Bandwidth" and "75% Bandwidth" mean? 
3) And how to calculate them? 
4) How to getTheoretical Peak? (the same with question 1)
3. In page 15, how to calculate the active wavefornt count?

IMO, active wavefront count(per SIMD) = floor(256 / GPRs per thread), right?

But according to this lecture, , it seems not.

In CAL if register count > 12, 160 physical registers are made available
Wavefront count = floor(Available / Used)
Kernel 1 gets 6 wavefronts
Kernel 2 gets 12 wavefronts
What is the difference?

4. In page 16, it mentions that

Assuming 100% cache hit rate

8 Fetches * 4 Cycles/Fetch + 120 TEX/L1  = 152 cycles to hide (why 120 here?if 100% cache hit rate, only L1 cache cycle will be included.)
Kernel 1:
18 ALU * 6 Wavefronts = 108 cycles (why not 6 - 1 = 5 here? The wavefront is stalled, and there are 5 wavefronts left to hide the latency.)
Kernel 2:
19 ALU * 12 Wavefronts = 228 cycles 
5. In page 17, Why A 4 cache lines, B^T 4 cache lines, while B 2 cache lines?
Thank you.


4 Replies

Those slides are in reference to pixel shader on RV670 and do not map exactly to compute shader programming models.

1) Timing information can be found here:
2) These were assumptions I was making assuming I was getting 0% of peak bandwidth and 75% of peak bandwidth(i.e. 0% cache hits and 75% cache hits)

4) 120 cycles is the number of cycles it takes to processes a Tex instruction/L1 cache hit on an RV670 chip. I want to calculate the total amount of time it takes to execute all wavefronts, not the amount of latency that needs to be covered up.
5) If you map out how all the data reads for a single wavefront, A and B^T both hit 4 cachelines and B hits 2 cachelines, this is because it is a 4x8 versus and 8x4
6) This animation shows how data fits into the cache and how it overflows the cache because too many wavefronts are bringing in data at once.



Thank you first. Please help me out of the following problems:

1. Through the formula in the doc you gave, for ALU time and Tex time, I can get close results to yours, but I still have some problem with the bandwidth under the condition of 0% and 75% cache hit.

There are 4 outputs, with each float4 format, right? Then according to the formula, the Mem time = (262144 * 4 * 128 bit) / (256 * 1150Mhz * 2) , which is quite different with yours. What is wrong?

And I don't quite understand how cache hit ratio affects the output time? IMO, cache hit ratio only deals with reading data.

2. Can you explain the way to calculate the active wavefront count? I think it should be floor(256 / GPRs per thread), not floor(160 / GPRs per thread). And "In CAL if register count > 12, 160 physical registers are made available", where can I find this information?

3. About the cache lines count. In a wavefront, each thread reads 4 float4  from 4 different parts of A in a loop, and each cache line is a 4 * 2 block data, which is 4 * 2 * 16B = 128B, how did you get that A hits 4 cache lines?

And in the animation for kernel 1,

WF 0: 4 * 16B (16CL)

WF 0: 16 * 16B (16CL)

WF 1: 16 * 16B (16CL)

WF 1: 16 * 16B (16CL)

can you explain why amount of data of A is different from B in wavefront 0?

how does "16 cache lines" come out?

In wavefront 1, amount of data of A becomes 16 * 16B, why?

Too many boring questions, hope that won't dispirit you


1) this time is for going all the way out to memory to read data, not for output.
2) This information is not public except for this statement. It only applies to pixel shader and not in compute shader. In compute shader mode, all the registers except for a few are allocated to be used, in pixel shader mode, registers are allocated to other segments of the graphics pipeline. That puts the limit at 160 registers.
3) Because the wavefront is allocated in a 8x8 block, so there are 8 rows of data being read at the same time. Since the cache size is 2 rows high, 8/2 = 4 cache lines. There are 4 input buffers, so 4 cache lines per input buffer * 4 input buffers = 16 cache lines. The difference between A and B is a typo and it should be the same for everything.