Thanks for your quick response. Would using global memory be the same as letting it spill on private memory? I thought global memory will be used when registers spill?
When happens when local memory spill?
Thanks again. "Device Memory" is what I should have said(brain freezed)
Is there a recommended strategy to design the kernel to make the caching more effective with global memory?
I had the same problem on an NVidia card :
using array => private memory reported
using plain registers => zero private memory reported (no spill)
CL_KERNEL_WORK_GROUP_SIZE allows automatic tuning but I must not try to compile with attribute reqd_work_group_size since it would have CL_KERNEL_WORK_GROUP_SIZE increased to this value (provided local memory is not exhausted) and spilling forced.
On NVidia the cLGetDeviceInfo(...CL_DEVICE_REGISTERS_PER_BLOCK_NV...) gives the size of the register file (on AMD it is 64*256*(32bits*4) AFAIK) but on both GPUs I have understood that the register addressing allows only 128 register/thread (and a handfull of them contains group_id, local_id, constant kernel args...)
The only portable (tested on 5 models of NVidia cards and 1 model of AMD card) way I found was to start from the group size given by CL_DEVICE_MAX_WORK_GROUP_SIZE compile without reqd_work_group_size attribute and check CL_KERNEL_WORK_GROUP_SIZE if below the tested group size, lower it (depending of you code constraint NOT to the value returned by CL_KERNEL_WORK_GROUP_SIZE otherwise you'll end up with a too small group size) and go on until you have CL_KERNEL_WORK_GROUP_SIZE return >= your tested value.
It is tedious to program and slow to compile, so I suggest requiring CL_DEVICE_REGISTERS_PER_BLOCK and something as CL_DEVICE_REGISTERS_PER_THREAD to be required for forecoming OpenCL specifications (just to have a good estimation of register availability to start with)