# Questions about ACML-GPU - SGEMM Optimization Illustration

Discussion created by rexiaoyu on Dec 4, 2009
Latest reply on Dec 7, 2009 by MicahVillmow

http://developer.amd.com/gpu_assets/ATI%20Stream%20Computing%20-%20ACML-GPU%20SGEMM%20Optimization%20Illustration.ppt

1.  In page 7, how to calculate Theoretical Peak (GFlops) of my kernel?

2.  Also in page 7, look at the perf workbook:

Perf Workbook(1k*1k):

• Total Pixels: 262144
• Total ALU: 4642
• Total Tex: 2048
• Total Out: 4
• ALU Time: 25.3515 ms
• Tex Time: 44.7392 ms
• 0% Bandwidth: 116.9390 ms
• 75% Bandwidth: 52.0302 ms
Bottleneck:
0% cache hit rate – Bandwidth
Theoretical Peak: 40.7193 Gflops
75% cache hit rate – Tex
Theoretical Peak: 140.0146 Gflops
Question:
1) How to calculate the ALU time and Tex time (with underline)?
2) What do "0% Bandwidth" and "75% Bandwidth" mean?
3) And how to calculate them?
4) How to getTheoretical Peak? (the same with question 1)
3. In page 15, how to calculate the active wavefornt count?

IMO, active wavefront count(per SIMD) = floor(256 / GPRs per thread), right?

But according to this lecture, , it seems not.

In CAL if register count > 12, 160 physical registers are made available
Wavefront count = floor(Available / Used)
Kernel 1 gets 6 wavefronts
Kernel 2 gets 12 wavefronts
What is the difference?

4. In page 16, it mentions that

Assuming 100% cache hit rate

8 Fetches * 4 Cycles/Fetch + 120 TEX/L1  = 152 cycles to hide (why 120 here?if 100% cache hit rate, only L1 cache cycle will be included.)
Kernel 1:
18 ALU * 6 Wavefronts = 108 cycles (why not 6 - 1 = 5 here? The wavefront is stalled, and there are 5 wavefronts left to hide the latency.)
Kernel 2:
19 ALU * 12 Wavefronts = 228 cycles
5. In page 17, Why A 4 cache lines, B^T 4 cache lines, while B 2 cache lines?
Thank you.