Archives Discussions

kbrafford · ‎06-04-2010

If you have a kernel that operates on a bunch of float4's, if your GPU has a 256 bit data path, would it make sense to read the incoming data as float8's, then access them as two float4's (via a pointer perhaps)? Would that successfully hide the memory latency of one of the float4 accesses?

Assuming that works, what are the ramifications of that same code being compiled into a CPU context? Will the same code still produce correct results and not suffer any degradation?

omkaranathan · ‎06-04-2010

kbrafford,

'OpenCL Performance and Optimization' section of OpenCL Programming guide explains in detail about the memory optimizations. That should answer your query and give you an idea on how to do efficient memory access.

ATI StreamSDK OpenCL Programming Guide

bubu · ‎06-04-2010

Nice PDF. Btw, why the constant buffer is limited to 16Kb? Are there 4 banks?

MicahVillmow · ‎06-04-2010

bubu,
That is a mistake that was not caught in time for the 2.1 release. In our next release this is expanded to 64kb which is what the hardware supports natively.

bubu · ‎06-04-2010

Btw, can't the max_constant_size attribute be forced in code using this?

kernel void mykernel(global int* a,
__constant int* b __attribute__((max_constant_size (16384)))

by

kernel void mykernel(global int* a,
__constant int b[16384] )

??

MicahVillmow · ‎06-04-2010

bubu,
That is something that will be available in our next release.

Archives Discussions

doing global reads as float8