If you have a kernel that operates on a bunch of float4's, if your GPU has a 256 bit data path, would it make sense to read the incoming data as float8's, then access them as two float4's (via a pointer perhaps)? Would that successfully hide the memory latency of one of the float4 accesses?
Assuming that works, what are the ramifications of that same code being compiled into a CPU context? Will the same code still produce correct results and not suffer any degradation?
kbrafford,
'OpenCL Performance and Optimization' section of OpenCL Programming guide explains in detail about the memory optimizations. That should answer your query and give you an idea on how to do efficient memory access.
Nice PDF. Btw, why the constant buffer is limited to 16Kb? Are there 4 banks?
Btw, can't the max_constant_size attribute be forced in code using this?
kernel void mykernel(global int* a,
__constant int* b __attribute__((max_constant_size (16384)))
by
kernel void mykernel(global int* a,
__constant int b[16384] )
??