Some performance-related questions about current OpenCL implementation

Discussion created by Raistmer on Mar 25, 2010
Latest reply on Apr 14, 2010 by MicahVillmow
1) Is it possible to have local (per thread) array stored into registers in OpenCL?
It was not possible in Brook+.
That is, if I write inside kernel:
float4 buf[32];
will these 32 elements placed into registers or they will be spilled into global memory?

2) If write to global memory buffer resides in not chosen branch inside branch instruction (per wavefront basis) will this write be avoided or zeros or junk will be written anyway?
Also, will such write be avoided per wavefront or per thread basis?

3) How many registers can be used per thread to still hide global memory read latence more or less effectively? (or how many wavefronts per SIMD should be launched simultaneously to hide read latence?)