Hi guys,
I am new in this forum und I have to admit that I am not a very experienced programmer in terms of gpu programming. Nonetheless I want to write some code to achieve peak performance (tflop/s) of my radeon 5870 (which is 2.72 TFlops). Getting an easy start, I downloaded "FlopsCL" from Kamil Rocki (see http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/FlopsCL_src_linux.zip).
Running the benchmark tool I got 2.15 TFlops (using float4). Thats impressive but by far not peak performance. Thus, I fired up CodeXL / AMD AppAnalyzer. The results:
KernelOccupancy = 100
ALUBusy = 49.84%
ALUPacking = 79.93%
How can I optimize the kernel code to get full peak performance? Above all, is 'peak performance' reachable, even in a synthetic test - or is this just a calculated number based on tech details? How can I get better ALUPacking (obviously there are only 4 ALUs utilized of the VLIW5-ALUs)?
Regards
Marcus
You could try this as well and see how it performs:
http://parallelplusplus.blogspot.gr/2014/01/benchmarking-capabilities-of-your_28.html
(4/5)*2.72 = 2.176. You are reaching 2.15TFLOPs. All is well!
VLIW5 can sure pack 5 operations together. But can all 5 be floating point math? You may need to check what are the arithmetic units that can be fed by VLIW5?
If the fifth module is a non-arithmetic stuff (like say Shifter or something like that), you really cant get MAD (Mul-add) flops using that.
For all calculatations, I used to assume 2.2TF as peak on 5870-- iirc.
The GCN Architecture whitepaper shows how graphics developed over various eras. In that PDF, VLIW5 is represented as 4 stream processors + 1 (FMAD + Special functions).
Special functions could mean sin/cos/tan functions. FMAD could be Fused multiply-add. I am not too sure though....
HTH
Best,
Bruhaspati
Hi guys, thanks for your input so far.
> You could try this as well and see how it performs:
> http://parallelplusplus.blogspot.gr/2014/01/benchmarking-capabilities-of-your_28.html
I gave this benchmark a try recently... and got 2.1 tflops as well.
> VLIW5 can sure pack 5 operations together. But can all 5 be floating point math? You may need to check what are the arithmetic units that can be fed by VLIW5?
> If the fifth module is a non-arithmetic stuff (like say Shifter or something like that), you really cant get MAD (Mul-add) flops using that.
> For all calculatations, I used to assume 2.2TF as peak on 5870-- iirc.
I think the 5D-ALU architecture can handle 5 MAD-Operations per cycle. There are 4 simple ALUs + 1 big ALU (see this pic here http://pics.computerbase.de/2/6/9/3/9/147_m.png - so calculating 5 32-bit floats in parallel shouldn't be a problem). I think those benchmark programs can't utilize the 5th ALU because of 'dependencies' of the instructions. This conforms to a ALUPack of 80% as stated by AMD Profiler. Taking a closer look at the benchmark kernel (see my initial posting for source reference), I think while dealing with 2 float4 you can never fully load 5 ALUs but 4 ALUs. Can anybody confirm this thought?
Excerpt of the benchmarking kernel (in pseudo code):
[...]
float4 a;
float 4b;
very long (unrolled) loop {
a = b * a + b; // fused multiply add of two float4
b = a * b + a; // fused multiply add of two float4
} |
[...]
Finally, can anybody give me some hints how to write a kernel for benchmarking which utilizes all 5 ALUs?
Regards
Marcus
Hi guys,
meanwhile I got 2.64 TFlops. I consider this as 'peak performance' (side note: theoretical peak performance (TFlop/s) of my radeon 5870 is 2.72 TFlops).
This is what my kernel looks like:
__kernel void sum_float4_vliw5(__global double* const dA, __global double* dResult) {
const size_t bx = get_group_id(0);
const size_t tx = get_local_id(0);
const unsigned pIndex = 256 * bx + tx;
float4 a = (float4)(dA[tx], dA[tx], dA[tx], dA[tx]);
float4 b = (float4)(1.01f, 1.02f, 1.03f, 1.04f);
float c = dA[tx];
float d = 1.02f;
for (unsigned i = 0; i < 600; i++) {
a = b * a + b;
b = a * b + a;
c = d * c + d;
d = c * d + c;
}
dResult[pIndex] = a.s0 + b.s0 + a.s1 + b.s1 + a.s2 + b.s2 + a.s3 + b.s3 + c + d;
}
Taking a look at the generated ISA code, I can confirm that all ALUs (x,y,z,w and t) are used. This perfectly fits a ALUPacking of 97.79%, as stated by the performance counters.
Here is an excerpt of my performance counter statistics:
Kernel Occupancy: 100%
ALUBusy: 49.77%
ALUPacking: 97.79%
Nonetheless I don't understand 'ALUBusy'. Why don't I see a value close to 100%? Why is it just 49.77%? Regarding ALUBusy, there is a hint stating 'The percentage of GPUTime ALU instructions are processed'. As long as my kernel reaches peak performance and there are no memory fetches why aren't all ALUs 100% of time executing instructions? I am confused...
Regards
Marcus
Hi,
ALUBusy performance counter shows the percentage of time GPU executes ALU instructions. A low count of ALUBusy means that there are no active wave-fronts to hide memory latency and GPU remains idle and waits for data from the memory.