When my kernel performs badly, the APP profiler reports a very low ALUBusy and low FetchUniBusy, (Both less than 10%)
What can be the bottleneck here? Could it be because of the high number of code paths?
Can you provide information about your device? If it's an AMD APU then there were problems with performance counters in previous versions of APP Profiler.
Also, check ALUPacking counter, if it has low value, then you code is VLIW limited and ALUBusy is poor, in this case try to reduce some data dependencies across sequential operations, it will allow compiler to better pack ALU instructions in VLIW, and utilize ALU resources. Try to reduce control flow statements, they affect counters to. In your situation, maybe you have if-statements, where in one branch you do fetch operation, and in another do some computations? That will cause some part of wavefront do fetch, and only after that remainder of wavefront will do ALU operations. So you will use only part of resources at time.
I have dual Radeon 6950 with either 12.3 or the new beta driver. It seems control flow was the issue, things are much better now. Is there an equation I can use to sum up the numbers of counters to 100%, so that I can be more certain I am not getting bogus numbers?
I guess no, there is no such equation. First of all because when fetch instruction is applied by wavefront executing on compute unit, this wavefront goes to fetch unit, where it sits until fetch is done. At this time other wavefronts are doing calculations, or wait unit fetch unit become free, to execute next fetch instructions. So when some wavefronts are doing memory read or write other can do computations, and in the best case both counters can have 100% value, and ALUFetchRatio counter will equal to 1. Another important counters is FetchUnitStalled and WriteUnitStalled, try to keep them about 0 value. If it's too big, then many of wavefront are waiting for fetch unit to do memory read/write. To improve performance first of all, try to use sequential memory access pattern, then try to use local memory, if your algorithm reuse data several timers within workgroup.
Retrieving data ...