while measuring the GFLOP/s performance of some programs I'm working on, I encountered some weird performance results on the APU...
1. I'm running a program that stores the input data on a 4096x4096 2d image objects and does some filtering. With the APU I get performance comparable to the Nvidia Tesla M2050 (~100GFLOP/s). However, the code is bandwidth limited and I would expect a lower performance on the APU. The results however seem to be correct.
2. The same program as in 1. is extended to use 3d image objects. The performance on the Nvidia GPU decreases drastically for an equivalent problem (256x256x256): goes to ~14 GFLOP/s, even lower than the 35 obtained on a 24-cores Magny-Cours. On the APU, on the other hand, the performance increases to ~120 GFLOP/s and still, the results look correct.
How is this possible? Are 3d image objects so much better on the APU than on the M2050?
3. The code from 1. is now incorporated into another program. Here however the performance on the APU shows some discrepancies: the timing results measured from CPU side is 2-4x the time measured with OpenCL events (the OpenCL events timing matches 1.). On the Nvidia GPU this problem does not appear and the measures with CPU matches the OpenCL events.
4. The ImageBandwidth contained in the APP gives me a bandwidth of 382 GB/s! The global memory bandwidth seems also pretty high: 96 GB/s. The shoc benchmark however gives me 60-70GB/s peak performance for image objects.
can someone help me make sense of these numbers?
PS in the performance results, I only count the kernel performance and not the memory transfers