I currently read through the AMD PDF ATI_Stream_SDK_OpenCL_Programming_Guide and have got some questions on the global memory optimization section in there.
The guide sais: "Note that the memory segments
indexed through base addresses A0 to An are not required to line up
sequentially; for optimal performance, they must be aligned to 128 bytes and must not overlap."
My kernel currently uses a 256MB array which holds uint2. I made sure, that the host memory is alligned and reserved it via: cl_uint *searchStrings = (cl_uint*)_aligned_malloc(sizeof(cl_uint2) * numCombinations, 16);
That array is passed to my kernel via a write buffer (mem object). And there it is accessed read-only 8 times for each work-item (value is used in an addition).
But I´m really unsure how to align to 128 Bytes and what it really means.