I have a kernel where each instance of the kernel must access 16 floats from global memory (in series) and perform some computation on them. Adjacent kernels may or may not need to access the same 16 floats; this depends on several parameters that are inputs. It is difficult (computationally) to compute if/when other kernels are sharing the same memory or not.
My current approach is to copy the 16 float values directly from the global memory to the private memory with the knowledge that this will lead to some redundant memory accesses with other kernels in the work group. The 16 floats are aligned to 8 byte boundaries (the input array is float2's), so for example, I would compute that I need to copy item 99 through 107 in an input array of float2s (yielding 16 floats, or 8 float2's).
My first question is this, can I copy the 16 floats using this method:
float16 private_memory_data = (*(global float16*)&input[computed_index]); // where input is a global float2 *
This computes and runs though it is a little hard for me to verify that this is working or not (for various unrelated reasons). The target platform is a FirePro S9150 (Hawaii), but this code is crashing on an NVidia board that doesn't support byte-addressible memory (that I have to use for development due to company IT policy...). Previously I was copying each float individually from global memory and it was running fairly slowly.
Also, is there any sort of automatic caching in local memory if a bunch of kernels are accessing the same global memory location, or is this something that I would need to code?