As shown by actual tests and the APP KernelAnalyzer in some cases disabling optimizations by -cl-opt-disable can produce much faster code.
As an example we have a kernel which has 181 ALU inst and 83 fetches with 44 writes. This is regardless of the added optimization flags. With -cl-opt-disable there are 48 ALU insts, 7 fetches and 6 writes. Benchmarked throughput and the estimated troughput are both approximately double of the 'optimized version'.
It seems the compiler does no sanity checks for the actual benefit of optimization passes it performs. I presume this will be improved in the future?