I am trying to implement an efficient sgemv routine where the vector is from the matrix itself. The matrix is in coloumn major layout. First, each workgroup loads the whole vector into __local memory. Here i assume that the number of coloumns (which are rows with this layout) is fairly small which should be true in my case. Then, in every iteration, the kernel loads a coloumn (physically a row) in a coalesced manner where each work item loads a float4. After multiplying the float4 and the corresponding value from the __local cache, the result is added to the accumulator and the next iteration begins with the next coloumn.

My problem is that the effective bandwidth is only 45% of the theoretical bound and i can't see where I could increase it. The profiler reports high values of FetchUnitStalled (~54%) and FetchUnitBusy(~96%) with low AluBusy. Any suggestions to speed things up?

I am using a Radeon HD5850 with SDK 2.4.

Please use SDK2.5 for your development. It would be good if you copy your kernel here. Could you please tell us how you are calculating that effective bandwidth is only 45% theoretical bandwidth.