cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bananafish
Adept I

Does get_global_id(...) enable direct-addressing from a constant pointer?

If I had a pointer arg to a constant buffer passed to my kernel i.e.

kernel void myKernel( constant my_struct* settingsForEachThread __attribute__((max_constant_size(16384))) {

...and I accessed each item of this pointer as such...

settingsForEachThread[get_global_id(0)]

... would this suffice to fulfill the "4.6 Constant Memory Optimization" of the ATI StreamSDK OpenCL Programming Guide, specifically clause #1 which states:

"Simple Direct-Addressing Patterns

Very high bandwidth can be attained when the compiler has available the

constant address at compile time and can embed the constant address into

the instruction. Each processing element can load up to 4x4-byte direct-

addressed constant values each cycle. Typically, these cases are limited to

simple non-array constants and function parameters. The GPU loads the

constants into a hardware cache at the beginning of the clause that uses the

constants. The cache is a tagged cache, typically each 8k blocks is shared

among four compute units. If the constant data is already present in the

constant cache, the load is serviced by the cache and does not require any

global memory bandwidth. The constant cache size for each device is given

in Appendix D, “Device Parameters”; it varies from 4k to 48k per GPU."

...assuming my buffer fits in the cache?

If so, would the load latency be close to one clock cycle on the first load of that address? - as in - would the compiler be clever enough to perform a preemptive load of these array elements such that they are ready when needed (or at least closer-to-ready than a simple global load)?

If the settingsForEachThread[get_global_id(0)] does indeed fulfill the requirements for Simple Direct-Addressing Patterns, can math with literals affecting get_global_id()'s result also fulfill this, i.e.

settingsForEachThread[get_global_id(0)%4]   ??

...in either case being true, would I be able to make repeated requests of this cache value with latency competitive with LDS and GPRs, for the purpose of conserving GPR space, i.e.

operation(settingsForEachThread[get_global_id(0)]);

anotherOperation(settingsForEachThread[get_global_id(0)]);

..instead of

const my_struct thisThreadSettings = settingsForEachThread[get_global_id(0)];

operation(thisThreadSettings);

anotherOperation(thisThreadSettings);

Is there anything glaringly obvious that I could change to make this work (better)?

What happens when I have multiple constant buffers where each alone will fit in the cache but altogether exceed the cache size?

In case this detail is necessary: It would be on Redwood hardware.

Thanks for reading. Partial answers and (properly disclaimed) speculation also welcome.

Message was edited by: Chuck Ritola  - fixed typos, added detail

0 Likes
1 Reply
notzed
Challenger

Well i'll speculate: I very much doubt it.

All threads execute the same instructions, and there's no chance they could put a compile-time fixed 'n' into foo and still access per-item data.

Easiest way to check would be to look at the assembly dump from both cases and compare (or hope someone does it for you).

As for your register saving stuff: I also doubt that.  The compiler will usually convert such multiple uses into a single one.  Of course, it may cost constant access differently too and do something 'smarter'.

0 Likes