If I had a pointer arg to a constant buffer passed to my kernel i.e.
kernel void myKernel( constant my_struct* settingsForEachThread __attribute__((max_constant_size(16384))) {
...and I accessed each item of this pointer as such...
settingsForEachThread[get_global_id(0)]
... would this suffice to fulfill the "4.6 Constant Memory Optimization" of the ATI StreamSDK OpenCL Programming Guide, specifically clause #1 which states:
"Simple Direct-Addressing Patterns
Very high bandwidth can be attained when the compiler has available the
constant address at compile time and can embed the constant address into
the instruction. Each processing element can load up to 4x4-byte direct-
addressed constant values each cycle. Typically, these cases are limited to
simple non-array constants and function parameters. The GPU loads the
constants into a hardware cache at the beginning of the clause that uses the
constants. The cache is a tagged cache, typically each 8k blocks is shared
among four compute units. If the constant data is already present in the
constant cache, the load is serviced by the cache and does not require any
global memory bandwidth. The constant cache size for each device is given
in Appendix D, “Device Parameters”; it varies from 4k to 48k per GPU."
...assuming my buffer fits in the cache?
If so, would the load latency be close to one clock cycle on the first load of that address? - as in - would the compiler be clever enough to perform a preemptive load of these array elements such that they are ready when needed (or at least closer-to-ready than a simple global load)?
If the settingsForEachThread[get_global_id(0)] does indeed fulfill the requirements for Simple Direct-Addressing Patterns, can math with literals affecting get_global_id()'s result also fulfill this, i.e.
settingsForEachThread[get_global_id(0)%4] ??
...in either case being true, would I be able to make repeated requests of this cache value with latency competitive with LDS and GPRs, for the purpose of conserving GPR space, i.e.
operation(settingsForEachThread[get_global_id(0)]);
anotherOperation(settingsForEachThread[get_global_id(0)]);
..instead of
const my_struct thisThreadSettings = settingsForEachThread[get_global_id(0)];
operation(thisThreadSettings);
anotherOperation(thisThreadSettings);
Is there anything glaringly obvious that I could change to make this work (better)?
What happens when I have multiple constant buffers where each alone will fit in the cache but altogether exceed the cache size?
In case this detail is necessary: It would be on Redwood hardware.
Thanks for reading. Partial answers and (properly disclaimed) speculation also welcome.
Message was edited by: Chuck Ritola - fixed typos, added detail