Hi, NURBS!
Can you provide information about your device? If it's an AMD APU then there were problems with performance counters in previous versions of APP Profiler.
Also, check ALUPacking counter, if it has low value, then you code is VLIW limited and ALUBusy is poor, in this case try to reduce some data dependencies across sequential operations, it will allow compiler to better pack ALU instructions in VLIW, and utilize ALU resources. Try to reduce control flow statements, they affect counters to. In your situation, maybe you have if-statements, where in one branch you do fetch operation, and in another do some computations? That will cause some part of wavefront do fetch, and only after that remainder of wavefront will do ALU operations. So you will use only part of resources at time.