I have several doubts about the caches in the ATI's OpenCL implementation. Could anybody clarify me this, pls?
1. Texture cache. Each time I sample a texture I suppose the returned value will be cached... Size pls? 4Kb-6Kb? How good is the cache if all the working units access the same data? Do I need to coalesce this?
2. Local cache. Is local memory cached? Need coalescing? Size? 16Kb?
3. Constant cache. More of the same... Size? 64Kb? Coalescing? How good is the cache if all the working units access the same data? How good if each working unit accesses unaligned data?
3. Global cache. Is global memory cached? Size?
I think you should clarify all this with nice visual graphs like it's done in the CUDA documents... with coloured blocks, arrows, etc... And put a table with the sizes, optimal alignments, etc...
also other question.... Is it possible inside a kernel to move data from the global memory to the constant cache or can be done only using a clXXXXXX host function before the kernel is executed?