I am new in this forum und I have to admit that I am not a very experienced programmer in terms of gpu programming. Nonetheless I want to write some code to achieve peak performance (tflop/s) of my radeon 5870 (which is 2.72 TFlops). Getting an easy start, I downloaded "FlopsCL" from Kamil Rocki (see http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/FlopsCL_src_linux.zip).
Running the benchmark tool I got 2.15 TFlops (using float4). Thats impressive but by far not peak performance. Thus, I fired up CodeXL / AMD AppAnalyzer. The results:
KernelOccupancy = 100
ALUBusy = 49.84%
ALUPacking = 79.93%
How can I optimize the kernel code to get full peak performance? Above all, is 'peak performance' reachable, even in a synthetic test - or is this just a calculated number based on tech details? How can I get better ALUPacking (obviously there are only 4 ALUs utilized of the VLIW5-ALUs)?