4 Replies Latest reply on Mar 20, 2015 5:48 PM by tzachi.cohen

    private vs local vs global for in-between sums and my memory access pattern

    gpgpucoder

      I am looking at hand-optimizing a kernel, starting by hand-unrolling its loops. The computations are a bit more complex than what I'm describing, but for sake of discussion, it accumulates some partial "sums".

       

      While thinking it over, I was trying to figure out most promising place to store the partial sums, as there could be a lot of them. My first thought was private, but as I said there could be a lot. Also, at the moment in the versions of my kernel using local memory, I'm getting close to tapping it out, so I'd have to reduce workgroup size if I'm to grab more local memory for the individual work-items. I can say also my first preliminary experiments with this aren't showing any appreciable improvement from the unrolling. I haven't examined any of the IL in those cases yet, just wanted to ask to get going in most fruitful direction.

       

      Anyhoo, the pattern for a first pass at hand-unrolling could look like this:

          accum[0] = a[0+y*w+0]     + a[0+y*w+1]     + a[0+y*w+2]     + a[0+y*w+3] +

                     a[0+(y+1)*w+0] + a[0+(y+1)*w+1] + a[0+(y+1)*w+2] + a[0+(y+1)*w+3] +

                     a[0+(y+2)*w+0] + a[0+(y+2)*w+1] + a[0+(y+2)*w+2] + a[0+(y+2)*w+3] +

                     a[0+(y+3)*w+0] + a[0+(y+3)*w+1] + a[0+(y+3)*w+2] + a[0+(y+3)*w+3];

          accum[1] = a[4+y*w+0]     + a[4+y*w+1]     + a[4+y*w+2]     + a[4+y*w+3] +

                     a[4+(y+1)*w+0] + a[4+(y+1)*w+1] + a[4+(y+1)*w+2] + a[4+(y+1)*w+3] +

                     a[4+(y+2)*w+0] + a[4+(y+2)*w+1] + a[4+(y+2)*w+2] + a[4+(y+2)*w+3] +

                     a[4+(y+3)*w+0] + a[4+(y+3)*w+1] + a[4+(y+3)*w+2] + a[4+(y+3)*w+3];

      and so on for my accum[n-1]. Although I haven't shown it above, within the original loop there's a conditional block to do different things sometimes with the partial sum.

      As you can see there is some strided memory access for whatever y and w happen to be. So I may like to alter it as follows... would this be faster? At least in terms of walking across memory?


          accum[0]  = a[0+y*w+0]     + a[0+y*w+1]     + a[0+y*w+2]     + a[0+y*w+3];

          accum[1]  = a[4+y*w+0]     + a[4+y*w+1]     + a[4+y*w+2]     + a[4+y*w+3];

           ... etc to some n or n/tile_xsize...

          accum[0] += a[0+(y+1)*w+0] + a[0+(y+1)*w+1] + a[0+(y+1)*w+2] + a[0+(y+1)*w+3];

          accum[1] += a[4+(y+1)*w+0] + a[4+(y+1)*w+1] + a[4+(y+1)*w+2] + a[4+(y+1)*w+3];

           ... etc to some n or n/tile_xsize...

          accum[0] += a[0+(y+2)*w+0] + a[0+(y+2)*w+1] + a[0+(y+2)*w+2] + a[0+(y+2)*w+3];

          accum[1] += a[4+(y+2)*w+0] + a[4+(y+2)*w+1] + a[4+(y+2)*w+2] + a[4+(y+2)*w+3];

           ... etc to some n or n/tile_xsize...

          accum[0] += a[0+(y+3)*w+0] + a[0+(y+3)*w+1] + a[0+(y+3)*w+2] + a[0+(y+3)*w+3];

          accum[1] += a[4+(y+3)*w+0] + a[4+(y+3)*w+1] + a[4+(y+3)*w+2] + a[4+(y+3)*w+3];

           ... etc to some n or n/tile_xsize...


      So now the memory access pattern would be changed.


      So to restate my questions, and add another:

      • Which of the above unrolled alternatives would perform better?
      • Where should I put accum[]? If I make it private (it could be over 100 floats), I expect it may spill to L1... am I right, would that be very bad?
      • In what situations might it make sense to write anything back to global memory?
        • My tentative thoughts were to see if I can put individual work items on some of these blocks, which might make sense for a kernel with a much larger number of partial sums...
      • Can the GPU use any sort of ILP in what I've outlined in these unrolled loops?


      Thanks!


        • Re: private vs local vs global for in-between sums and my memory access pattern

          I'll point some of the AMD engineers at this, but just to set expectations - they focus mostly on support-type questions, as in, "something is not working." or "how do I get an AMD tool to do what I need it to do."

           

          Anyone care to chime in? Sounds like a very interesting question to me.

          • Re: private vs local vs global for in-between sums and my memory access pattern
            tzachi.cohen

            Is 'accum[0]' and 'accum[1]' are mapped to the same work item? if so, the new pattern is beneficial. if 'accum[1]' and 'accum[2]' are two distinct work items of the same work group the change will make no difference. The access pattern heavily depends on the values of 'w' and 'y' .

             

            I will give you general guidelines:

            1.) AMD GPU cache line width is 64 bytes, hence whenever the kernel reads a DWORD the HW fetches the aligned 60 bytes around it to L1 cache for all threads running in the compute unit to enjoy.

            2.) LDS has double the read bandwidth of L1 cache! You have 64 KB of LDS in each compute unit - use it to cache data relevant to work items within the same work group.

            3.) Don't occupy yourself with loop unrollment, since the kernel is probably memory access bound, loop unrollment will have little effect. The compiler usually has better intuition with these decisions.

              • Re: private vs local vs global for in-between sums and my memory access pattern
                gpgpucoder

                Thank you Tzachi.

                 

                Yes, 'accum[0]' and 'accum[1]' are mapped to the same work item, as are the data they come from. For larger extents, I am considering using multiple work-items and then reduce through local, but that's a really big re-write, so I may code some simple mock-ups of that first.

                 

                The y and x extent of this would increment by 1. The w value will be static within a run, usually 1-2k, but could be a little bigger or smaller than that.


                Yes, this all appears to be very memory bandwidth bound. I didn't get as much of a boost with my local memory implementation as I had hoped, but it was a substantial improvement (our other discussion here). In fact in this particular instance, the overhead of going to local memory amounts to a slowdown on my Nvidia cards for most inputs, though the AMD cards generally handle it well with what I have right now. I have finished now working on another matter, so I can start to let all the work-group items help to stage the local memory, I think that will be the weekend project.

                tzachi.cohen wrote:

                 

                Is 'accum[0]' and 'accum[1]' are mapped to the same work item? if so, the new pattern is beneficial. if 'accum[1]' and 'accum[2]' are two distinct work items of the same work group the change will make no difference. The access pattern heavily depends on the values of 'w' and 'y' .

                 

                I will give you general guidelines:

                1.) AMD GPU cache line width is 64 bytes, hence whenever the kernel reads a DWORD the HW fetches the aligned 60 bytes around it to L1 cache for all threads running in the compute unit to enjoy.  ==> [me] I'm sure this is helping me then with my current implementation already. That also gives me a better understanding on how the L1 is used.

                2.) LDS has double the read bandwidth of L1 cache! You have 64 KB of LDS in each compute unit - use it to cache data relevant to work items within the same work group. ==> [me] I thought this was 32K, as clinfo on Hawaii says 32768, as well as on an older Turks card, for "Local memory size". Am I misunderstanding something?  I am using 13.251.9001.1001 driver, as I had regressed back to avoid malfunctions I had encountered on 14.12. (I had tried 14.9 or maybe 14.4 and found some other issue there so...)

                3.) Don't occupy yourself with loop unrollment, since the kernel is probably memory access bound, loop unrollment will have little effect. The compiler usually has better intuition with these decisions.  ==> [me] Good to know, my previous experience on CPU was loop unrolling would help in many such memory bandwidth bound cases.

                 

                Regards, me