2 Replies Latest reply on Sep 23, 2011 11:12 PM by notzed

    Memory usage design pattern

    fajsc88
      What is the typical memory usage design pattern for global, local, and private memory?

      I'm new to OpenCL and am experiencing performance issues that appear to be related to memory usage.  From what I've read in the AMD programmers guide and OpenCL spec, it seems the typical design pattern is:

      1.  Provide data set to kernel in global memory space.

      2.  Copy data set to local memory space.

      3.  Perform computations on local memory space.

      4.  Copy results from local memory space back to global memory space.

      5.  Application then reads the data from global memory space (e.g. clEnqueueReadBuffer)

      My question is how does this design pattern apply to large data sets?  The local memory space (LDS) on my GPU is only 32K, which appears to be fairly typical. If my program needs to work on a data set that is significantly larger, what is the design pattern to follow for optimal performance while still using LDS?

        • Memory usage design pattern
          nou

          you should use local memory only when you can reuse it from mutiple work-items in the same work-group.

          • Memory usage design pattern
            notzed

            There isn't really a "design pattern" for it's use: it depends completely on the problem being solved.

            But there are some rules of thumb for when it's useful:

            - when you can share data between threads

            - when you need to access the same data often (i.e. a cache)

            - when you can use it to re-arrange memory accesses to be 'memory friendly' (when they wouldn't otherwise be)

            e.g. an 'algorithm friendly' workgroup topology might not be 'memory friendly', but sometimes you can split the operation into parts: a memory friendly part which gathers data into local store, and an algorithm friendly part which works on that data.  Even if you only ever read that data once in the algorithm it could still be a significant win.

            IMHO local memory is the real key feature that makes OpenCL/GPU worth it, but I find it one of the more challenging components to use effectively.

            Without some code to look at, it's hard to suggest whether local memory would be of any help to your problem.  But really, unless it's simply an element-by-element array operation, the answer is : probably yes.