1) Is it possible to have local (per thread) array stored into registers in OpenCL?
It was not possible in Brook+.
That is, if I write inside kernel:
float4 buf[32];
will these 32 elements placed into registers or they will be spilled into global memory?
2) If write to global memory buffer resides in not chosen branch inside branch instruction (per wavefront basis) will this write be avoided or zeros or junk will be written anyway?
Also, will such write be avoided per wavefront or per thread basis?
3) How many registers can be used per thread to still hide global memory read latence more or less effectively? (or how many wavefronts per SIMD should be launched simultaneously to hide read latence?)