Archives Discussions

sharpneli · ‎10-03-2011

Compiler seems to be too clever for it's own good in some cases

As shown by actual tests and the APP KernelAnalyzer in some cases disabling optimizations by -cl-opt-disable can produce much faster code.

As an example we have a kernel which has 181 ALU inst and 83 fetches with 44 writes. This is regardless of the added optimization flags. With -cl-opt-disable there are 48 ALU insts, 7 fetches and 6 writes. Benchmarked throughput and the estimated troughput are both approximately double of the 'optimized version'.

It seems the compiler does no sanity checks for the actual benefit of optimization passes it performs. I presume this will be improved in the future?

genaganna · ‎10-03-2011

Originally posted by: sharpneli As shown by actual tests and the APP KernelAnalyzer in some cases disabling optimizations by -cl-opt-disable can produce much faster code.

As an example we have a kernel which has 181 ALU inst and 83 fetches with 44 writes. This is regardless of the added optimization flags. With -cl-opt-disable there are 48 ALU insts, 7 fetches and 6 writes. Benchmarked throughput and the estimated troughput are both approximately double of the 'optimized version'.
It seems the compiler does no sanity checks for the actual benefit of optimization passes it performs. I presume this will be improved in the future?

It would be good if you give kernel code which reproduces this issue.

Could you please for which device you are compiling?

Please give us your system information(OS, CPU, GPU, SDK version and Driver version)

sharpneli · ‎10-04-2011

The device I'm actually using is Radeon HD 5770. Os is Win7, APP SDK is 2.5 and driver is 11.7.

I've managed to narrow down the problem. The kernel is basically 2 for loops. Both do the same thing but on different buffers. They basically go trough different edges of a mesh and if the connection does exist -> do stuff.

If one loop is deleted so that only one remains then optimization produces faster code, it does not matter which loop is deleted. So in essence splitting the kernel into two seems to produce much faster troughput.

However deleting a loop does not affect the amount of registers used and they are completely independent of eachother so there is no reason why splitting them up ought to produce faster code. Considering that disabling optimizations in the two loop case helps with perf it looks like a bug in the compiler.

P.s Fiddling with the workgroup size produced no difference whatsoever in the compiled code.

genaganna · ‎10-04-2011

Originally posted by: sharpneli The device I'm actually using is Radeon HD 5770. Os is Win7, APP SDK is 2.5 and driver is 11.7.

I've managed to narrow down the problem. The kernel is basically 2 for loops. Both do the same thing but on different buffers. They basically go trough different edges of a mesh and if the connection does exist -> do stuff.
If one loop is deleted so that only one remains then optimization produces faster code, it does not matter which loop is deleted. So in essence splitting the kernel into two seems to produce much faster troughput.
However deleting a loop does not affect the amount of registers used and they are completely independent of eachother so there is no reason why splitting them up ought to produce faster code. Considering that disabling optimizations in the two loop case helps with perf it looks like a bug in the compiler.
P.s Fiddling with the workgroup size produced no difference whatsoever in the compiled code.

Could you please paste a kernel here or file a ticket at http://developer.amd.com/support/KnowledgeBase/pages/HelpdeskTicketForm.aspx?

notzed · ‎10-03-2011

Sounds like the loop was unrolled: unrolling loops doesn't always run faster because of the register load (mostly), but the ALU count doesn't necessarily mean the code is badly optimised.

e.g. the actual code might run faster for a given work-group, but you can't run as many of them if it uses too many registers.

But the problem is a bit trickier than that: the compiler doesn't know how many workgroups you're going to run, so optimising for workgroup parallelism isn't always the correct approach anyway. In short, you're going to have to help the compiler a bit.

The reqd_work_group_size() annotation is about the best you can do here to tell the compiler how you're going to run it, and perhaps judicous use of #pragma unroll.

Archives Discussions

Disabling optimizations produce faster code