I am trying to implement an efficient sgemv routine where the vector is from the matrix itself. The matrix is in coloumn major layout. First, each workgroup loads the whole vector into __local memory. Here i assume that the number of coloumns (which are rows with this layout) is fairly small which should be true in my case. Then, in every iteration, the kernel loads a coloumn (physically a row) in a coalesced manner where each work item loads a float4. After multiplying the float4 and the corresponding value from the __local cache, the result is added to the accumulator and the next iteration begins with the next coloumn.
My problem is that the effective bandwidth is only 45% of the theoretical bound and i can't see where I could increase it. The profiler reports high values of FetchUnitStalled (~54%) and FetchUnitBusy(~96%) with low AluBusy. Any suggestions to speed things up?
I am using a Radeon HD5850 with SDK 2.4.