# private vs local vs global for in-between sums and my memory access pattern

Question asked by gpgpucoder on Mar 19, 2015
Latest reply on Mar 20, 2015 by tzachi.cohen

I am looking at hand-optimizing a kernel, starting by hand-unrolling its loops. The computations are a bit more complex than what I'm describing, but for sake of discussion, it accumulates some partial "sums".

While thinking it over, I was trying to figure out most promising place to store the partial sums, as there could be a lot of them. My first thought was private, but as I said there could be a lot. Also, at the moment in the versions of my kernel using local memory, I'm getting close to tapping it out, so I'd have to reduce workgroup size if I'm to grab more local memory for the individual work-items. I can say also my first preliminary experiments with this aren't showing any appreciable improvement from the unrolling. I haven't examined any of the IL in those cases yet, just wanted to ask to get going in most fruitful direction.

Anyhoo, the pattern for a first pass at hand-unrolling could look like this:

accum[0] = a[0+y*w+0]     + a[0+y*w+1]     + a[0+y*w+2]     + a[0+y*w+3] +

a[0+(y+1)*w+0] + a[0+(y+1)*w+1] + a[0+(y+1)*w+2] + a[0+(y+1)*w+3] +

a[0+(y+2)*w+0] + a[0+(y+2)*w+1] + a[0+(y+2)*w+2] + a[0+(y+2)*w+3] +

a[0+(y+3)*w+0] + a[0+(y+3)*w+1] + a[0+(y+3)*w+2] + a[0+(y+3)*w+3];

accum[1] = a[4+y*w+0]     + a[4+y*w+1]     + a[4+y*w+2]     + a[4+y*w+3] +

a[4+(y+1)*w+0] + a[4+(y+1)*w+1] + a[4+(y+1)*w+2] + a[4+(y+1)*w+3] +

a[4+(y+2)*w+0] + a[4+(y+2)*w+1] + a[4+(y+2)*w+2] + a[4+(y+2)*w+3] +

a[4+(y+3)*w+0] + a[4+(y+3)*w+1] + a[4+(y+3)*w+2] + a[4+(y+3)*w+3];

and so on for my accum[n-1]. Although I haven't shown it above, within the original loop there's a conditional block to do different things sometimes with the partial sum.

As you can see there is some strided memory access for whatever y and w happen to be. So I may like to alter it as follows... would this be faster? At least in terms of walking across memory?

accum[0]  = a[0+y*w+0]     + a[0+y*w+1]     + a[0+y*w+2]     + a[0+y*w+3];

accum[1]  = a[4+y*w+0]     + a[4+y*w+1]     + a[4+y*w+2]     + a[4+y*w+3];

... etc to some n or n/tile_xsize...

accum[0] += a[0+(y+1)*w+0] + a[0+(y+1)*w+1] + a[0+(y+1)*w+2] + a[0+(y+1)*w+3];

accum[1] += a[4+(y+1)*w+0] + a[4+(y+1)*w+1] + a[4+(y+1)*w+2] + a[4+(y+1)*w+3];

... etc to some n or n/tile_xsize...

accum[0] += a[0+(y+2)*w+0] + a[0+(y+2)*w+1] + a[0+(y+2)*w+2] + a[0+(y+2)*w+3];

accum[1] += a[4+(y+2)*w+0] + a[4+(y+2)*w+1] + a[4+(y+2)*w+2] + a[4+(y+2)*w+3];

... etc to some n or n/tile_xsize...

accum[0] += a[0+(y+3)*w+0] + a[0+(y+3)*w+1] + a[0+(y+3)*w+2] + a[0+(y+3)*w+3];

accum[1] += a[4+(y+3)*w+0] + a[4+(y+3)*w+1] + a[4+(y+3)*w+2] + a[4+(y+3)*w+3];

... etc to some n or n/tile_xsize...

So now the memory access pattern would be changed.

So to restate my questions, and add another:

• Which of the above unrolled alternatives would perform better?
• Where should I put accum[]? If I make it private (it could be over 100 floats), I expect it may spill to L1... am I right, would that be very bad?
• In what situations might it make sense to write anything back to global memory?
• My tentative thoughts were to see if I can put individual work items on some of these blocks, which might make sense for a kernel with a much larger number of partial sums...
• Can the GPU use any sort of ILP in what I've outlined in these unrolled loops?

Thanks!