cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

NURBS
Journeyman III

CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!

Thanks for your quick response. Would using global memory be the same as letting it spill on private memory? I thought global memory will be used when registers spill? 

When happens when local memory spill? 

0 Likes
MicahVillmow
Staff
Staff

CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!

NURBS,
You cannot spill local memory as it is something that is allocated by the program and if you allocate too much memory, you do not succeed in compilation. Global memory and scratch are both device memory, but global memory can be cached, scratch memory is not. So it could be quite a bit faster.
0 Likes
NURBS
Journeyman III

CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!

Thanks again. "Device Memory" is what I should have said(brain freezed)

Is there a recommended strategy to design the kernel to make the caching more effective with global memory?

 

0 Likes
MicahVillmow
Staff
Staff

CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!

Read the memory section of our programming guide. It should have all of the information you need there.

0 Likes
cantallo
Journeyman III

Re: CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!

I had the same problem on an NVidia card :

using array => private memory reported

using plain registers => zero private memory reported (no spill)

CL_KERNEL_WORK_GROUP_SIZE allows automatic tuning but I must not try to compile with attribute reqd_work_group_size since it would have CL_KERNEL_WORK_GROUP_SIZE increased to this value (provided local memory is not exhausted) and spilling forced.

On NVidia the cLGetDeviceInfo(...CL_DEVICE_REGISTERS_PER_BLOCK_NV...) gives the size of the register file (on AMD it is 64*256*(32bits*4) AFAIK) but on both GPUs I have understood that the register addressing allows only 128 register/thread (and a handfull of them contains group_id, local_id, constant kernel args...)

The only portable (tested on 5 models of NVidia cards and 1 model of AMD card) way I found was to start from the group size given by CL_DEVICE_MAX_WORK_GROUP_SIZE compile without reqd_work_group_size attribute and check CL_KERNEL_WORK_GROUP_SIZE if below the tested group size, lower it (depending of you code constraint NOT to the value returned by CL_KERNEL_WORK_GROUP_SIZE otherwise you'll end up with a too small group size) and go on until you have CL_KERNEL_WORK_GROUP_SIZE return >= your tested value.

It is tedious to program and slow to compile, so I suggest requiring CL_DEVICE_REGISTERS_PER_BLOCK and something as CL_DEVICE_REGISTERS_PER_THREAD to be required for forecoming OpenCL specifications (just to have a good estimation of register availability to start with)

0 Likes