cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bolgarbe
Journeyman III

sgemv question

I am trying to implement an efficient sgemv routine where the vector is from the matrix itself. The matrix is in coloumn major layout. First, each workgroup loads the whole vector into __local memory. Here i assume that the number of coloumns (which are rows with this layout) is fairly small which should be true in my case. Then, in every iteration, the kernel loads a coloumn (physically a row) in a coalesced manner where each work item loads a float4. After multiplying the float4 and the corresponding value from the __local cache, the result is added to the accumulator and the next iteration begins with the next coloumn.
My problem is that the effective bandwidth is only 45% of the theoretical bound and i can't see where I could increase it. The profiler reports high values of FetchUnitStalled (~54%) and FetchUnitBusy(~96%) with low AluBusy. Any suggestions to speed things up?
I am using a Radeon HD5850 with SDK 2.4.

0 Likes
1 Reply
genaganna
Journeyman III

Originally posted by: bolgarbe I am trying to implement an efficient sgemv routine where the vector is from the matrix itself. The matrix is in coloumn major layout. First, each workgroup loads the whole vector into __local memory. Here i assume that the number of coloumns (which are rows with this layout) is fairly small which should be true in my case. Then, in every iteration, the kernel loads a coloumn (physically a row) in a coalesced manner where each work item loads a float4. After multiplying the float4 and the corresponding value from the __local cache, the result is added to the accumulator and the next iteration begins with the next coloumn. My problem is that the effective bandwidth is only 45% of the theoretical bound and i can't see where I could increase it. The profiler reports high values of FetchUnitStalled (~54%) and FetchUnitBusy(~96%) with low AluBusy. Any suggestions to speed things up? I am using a Radeon HD5850 with SDK 2.4.

Please use SDK2.5 for your development. It would be good if you copy your kernel here.  Could you please tell us how you are calculating that effective bandwidth is only 45% theoretical bandwidth.

0 Likes