The following picture of a simple stockham radix-2 made me think about what is going on in the SKA...
After compilation, the kernel's loops are unrolled for all architectures (I used constants in an #ifdef SKA - #endif block just for this purpose).
The numbers I get for the 4xxx series are more or less sane, but all numbers for the 5xxx series completely depend on the settings for avg. loop count. I could understand this in case the compiler was unable to unroll loops, but this happened. So why do these settings have any influence at all?
I think it is important to get proper feedback from these tools, otherwise they are quite useless. Atm I would rather measure the performance than using SKA if I had some more HW around :/
Can you please provide this kernel.