I'm new to OpenCL and am experiencing performance issues that appear to be related to memory usage. From what I've read in the AMD programmers guide and OpenCL spec, it seems the typical design pattern is:
1. Provide data set to kernel in global memory space.
2. Copy data set to local memory space.
3. Perform computations on local memory space.
4. Copy results from local memory space back to global memory space.
5. Application then reads the data from global memory space (e.g. clEnqueueReadBuffer)
My question is how does this design pattern apply to large data sets? The local memory space (LDS) on my GPU is only 32K, which appears to be fairly typical. If my program needs to work on a data set that is significantly larger, what is the design pattern to follow for optimal performance while still using LDS?