while measuring the GFLOP/s performance of some programs I'm working on, I encountered some weird performance results on the APU...
1. I'm running a program that stores the input data on a 4096x4096 2d image objects and does some filtering. With the APU I get performance comparable to the Nvidia Tesla M2050 (~100GFLOP/s). However, the code is bandwidth limited and I would expect a lower performance on the APU. The results however seem to be correct.
2. The same program as in 1. is extended to use 3d image objects. The performance on the Nvidia GPU decreases drastically for an equivalent problem (256x256x256): goes to ~14 GFLOP/s, even lower than the 35 obtained on a 24-cores Magny-Cours. On the APU, on the other hand, the performance increases to ~120 GFLOP/s and still, the results look correct.
How is this possible? Are 3d image objects so much better on the APU than on the M2050?
3. The code from 1. is now incorporated into another program. Here however the performance on the APU shows some discrepancies: the timing results measured from CPU side is 2-4x the time measured with OpenCL events (the OpenCL events timing matches 1.). On the Nvidia GPU this problem does not appear and the measures with CPU matches the OpenCL events.
4. The ImageBandwidth contained in the APP gives me a bandwidth of 382 GB/s! The global memory bandwidth seems also pretty high: 96 GB/s. The shoc benchmark however gives me 60-70GB/s peak performance for image objects.
can someone help me make sense of these numbers?
PS in the performance results, I only count the kernel performance and not the memory transfers
It looks like the events are not performing properly, or maybe you are not using them properly. Can you provide some testcase where this discriminating timing issue can be reproduced.
Also mention your SDK,Catalyst,Os etc
I guess memory bound kernels may be able to produce more performance with APUs and memory bandwidth is more here.
I assume you are testing on a Llano-based APU. Keep in mind that while system memory bandwidth is likely lower than a discrete GPU's local memory bandwidth, if your algorithm makes good use of the caches in the APU, then performance can still be quite good.
I don't have any insight into what your particular test is doing, but there's more to performance than just external memory bandwidth, even if your test is bandwidth limited.