1 Reply Latest reply on May 15, 2012 6:47 PM by notzed

    Does get_global_id(...) enable direct-addressing from a constant pointer?


      If I had a pointer arg to a constant buffer passed to my kernel i.e.


      kernel void myKernel( constant my_struct* settingsForEachThread __attribute__((max_constant_size(16384))) {


      ...and I accessed each item of this pointer as such...




      ... would this suffice to fulfill the "4.6 Constant Memory Optimization" of the ATI StreamSDK OpenCL Programming Guide, specifically clause #1 which states:

      "Simple Direct-Addressing Patterns

      Very high bandwidth can be attained when the compiler has available the

      constant address at compile time and can embed the constant address into

      the instruction. Each processing element can load up to 4x4-byte direct-

      addressed constant values each cycle. Typically, these cases are limited to

      simple non-array constants and function parameters. The GPU loads the

      constants into a hardware cache at the beginning of the clause that uses the

      constants. The cache is a tagged cache, typically each 8k blocks is shared

      among four compute units. If the constant data is already present in the

      constant cache, the load is serviced by the cache and does not require any

      global memory bandwidth. The constant cache size for each device is given

      in Appendix D, “Device Parameters”; it varies from 4k to 48k per GPU."


      ...assuming my buffer fits in the cache?


      If so, would the load latency be close to one clock cycle on the first load of that address? - as in - would the compiler be clever enough to perform a preemptive load of these array elements such that they are ready when needed (or at least closer-to-ready than a simple global load)?


      If the settingsForEachThread[get_global_id(0)] does indeed fulfill the requirements for Simple Direct-Addressing Patterns, can math with literals affecting get_global_id()'s result also fulfill this, i.e.


      settingsForEachThread[get_global_id(0)%4]   ??


      ...in either case being true, would I be able to make repeated requests of this cache value with latency competitive with LDS and GPRs, for the purpose of conserving GPR space, i.e.





      ..instead of

      const my_struct thisThreadSettings = settingsForEachThread[get_global_id(0)];




      Is there anything glaringly obvious that I could change to make this work (better)?


      What happens when I have multiple constant buffers where each alone will fit in the cache but altogether exceed the cache size?


      In case this detail is necessary: It would be on Redwood hardware.


      Thanks for reading. Partial answers and (properly disclaimed) speculation also welcome.


      Message was edited by: Chuck Ritola  - fixed typos, added detail

        • Re: Does get_global_id(...) enable direct-addressing from a constant pointer?

          Well i'll speculate: I very much doubt it.


          All threads execute the same instructions, and there's no chance they could put a compile-time fixed 'n' into foo[n] and still access per-item data.


          Easiest way to check would be to look at the assembly dump from both cases and compare (or hope someone does it for you).


          As for your register saving stuff: I also doubt that.  The compiler will usually convert such multiple uses into a single one.  Of course, it may cost constant access differently too and do something 'smarter'.