
Re: Get peak performance at Radeon 5870
ekondis Feb 4, 2014 11:42 PM (in response to marcusfromeurope)You could try this as well and see how it performs:
http://parallelplusplus.blogspot.gr/2014/01/benchmarkingcapabilitiesofyour_28.html

developer Feb 5, 2014 12:53 AM (in response to marcusfromeurope)(4/5)*2.72 = 2.176. You are reaching 2.15TFLOPs. All is well!
VLIW5 can sure pack 5 operations together. But can all 5 be floating point math? You may need to check what are the arithmetic units that can be fed by VLIW5?
If the fifth module is a nonarithmetic stuff (like say Shifter or something like that), you really cant get MAD (Muladd) flops using that.
For all calculatations, I used to assume 2.2TF as peak on 5870 iirc.
The GCN Architecture whitepaper shows how graphics developed over various eras. In that PDF, VLIW5 is represented as 4 stream processors + 1 (FMAD + Special functions).
Special functions could mean sin/cos/tan functions. FMAD could be Fused multiplyadd. I am not too sure though....
marcusfromeurope Feb 5, 2014 9:56 AM (in response to developer)Hi guys, thanks for your input so far.
> You could try this as well and see how it performs:
> http://parallelplusplus.blogspot.gr/2014/01/benchmarkingcapabilitiesofyour_28.html
I gave this benchmark a try recently... and got 2.1 tflops as well.
> VLIW5 can sure pack 5 operations together. But can all 5 be floating point math? You may need to check what are the arithmetic units that can be fed by VLIW5?
> If the fifth module is a nonarithmetic stuff (like say Shifter or something like that), you really cant get MAD (Muladd) flops using that.
> For all calculatations, I used to assume 2.2TF as peak on 5870 iirc.
I think the 5DALU architecture can handle 5 MADOperations per cycle. There are 4 simple ALUs + 1 big ALU (see this pic here http://pics.computerbase.de/2/6/9/3/9/147_m.png  so calculating 5 32bit floats in parallel shouldn't be a problem). I think those benchmark programs can't utilize the 5th ALU because of 'dependencies' of the instructions. This conforms to a ALUPack of 80% as stated by AMD Profiler. Taking a closer look at the benchmark kernel (see my initial posting for source reference), I think while dealing with 2 float4 you can never fully load 5 ALUs but 4 ALUs. Can anybody confirm this thought?
Excerpt of the benchmarking kernel (in pseudo code):
[...]
float4 a;
float 4b;
very long (unrolled) loop {
a = b * a + b; // fused multiply add of two float4
b = a * b + a; // fused multiply add of two float4
} [...]
Finally, can anybody give me some hints how to write a kernel for benchmarking which utilizes all 5 ALUs?
marcusfromeurope Feb 22, 2014 2:42 AM (in response to marcusfromeurope)Hi guys,
meanwhile I got 2.64 TFlops. I consider this as 'peak performance' (side note: theoretical peak performance (TFlop/s) of my radeon 5870 is 2.72 TFlops).
This is what my kernel looks like:
__kernel void sum_float4_vliw5(__global double* const dA, __global double* dResult) {
const size_t bx = get_group_id(0);
const size_t tx = get_local_id(0);
const unsigned pIndex = 256 * bx + tx;
float4 a = (float4)(dA[tx], dA[tx], dA[tx], dA[tx]);
float4 b = (float4)(1.01f, 1.02f, 1.03f, 1.04f);
float c = dA[tx];
float d = 1.02f;
for (unsigned i = 0; i < 600; i++) {
a = b * a + b;
b = a * b + a;
c = d * c + d;
d = c * d + c;
}
dResult[pIndex] = a.s0 + b.s0 + a.s1 + b.s1 + a.s2 + b.s2 + a.s3 + b.s3 + c + d;
}
Taking a look at the generated ISA code, I can confirm that all ALUs (x,y,z,w and t) are used. This perfectly fits a ALUPacking of 97.79%, as stated by the performance counters.
Here is an excerpt of my performance counter statistics:
Kernel Occupancy: 100%
ALUBusy: 49.77%
ALUPacking: 97.79%
Nonetheless I don't understand 'ALUBusy'. Why don't I see a value close to 100%? Why is it just 49.77%? Regarding ALUBusy, there is a hint stating 'The percentage of GPUTime ALU instructions are processed'. As long as my kernel reaches peak performance and there are no memory fetches why aren't all ALUs 100% of time executing instructions? I am confused...
sudarshan Jun 9, 2014 4:31 AM (in response to marcusfromeurope)Hi,
ALUBusy performance counter shows the percentage of time GPU executes ALU instructions. A low count of ALUBusy means that there are no active wavefronts to hide memory latency and GPU remains idle and waits for data from the memory.


