cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sonicx
Journeyman III

CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!

CL_KERNEL_PRIVATE_MEM_SIZE returns zero, always

Hello.

I have a kernel that uses a LOT of private memory. The amount it uses depends and is not known until compilation (array-size set by macros). In order to avoid register-spilling i try to find the optimum work-group-size. To do so i need to know how many memory i can and need to use. However using getWorkGroupInfo after i compiled my kernel never works. Just returns zero. getWorkGroupInfo seems to work (returned values seem to be rounded up towards next register-size-multiple).

I am not sure if this is a bug or my stupidity, because same software on NVIDIA-GPUs has same problems. Don't think my getWorkGroupInfo calls are wrong cause CL_KERNEL_COMPILE_WORK_GROUP_SIZE ie returns good values.

Has anybody ever gotten a real value from getWorkGroupInfo or knows another way to avoid register-spilling without hardcoding hand-counted values?

EDIT: Attached info

PS: Yes, i tried putting explicit "__private", "private" or omitting the qualifier.

Querying OpenCL... Searching for OpenCL platform... Found 1 platform(s): Plaform Profile: FULL_PROFILE Plaform Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Plaform Name: AMD Accelerated Parallel Processing Plaform Vendor: Advanced Micro Devices, Inc. Plaform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Plaform Name: AMD Accelerated Parallel Processing Number of devices: 3 Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 18 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 700Mhz Address bits: 32 Max memory allocation: 268435456 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1385 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 18 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 700Mhz Address bits: 32 Max memory allocation: 268435456 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1385 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 8 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 3423Mhz Address bits: 64 Max memory allocation: 2147483648 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 32768 Global memory size: 8377356288 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Vendor: GenuineIntel Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf

0 Likes
14 Replies
genaganna
Journeyman III

Originally posted by: sonicx Hello.

I have a kernel that uses a LOT of private memory. The amount it uses depends and is not known until compilation (array-size set by macros). In order to avoid register-spilling i try to find the optimum work-group-size. To do so i need to know how many memory i can and need to use. However using getWorkGroupInfo after i compiled my kernel never works. Just returns zero. getWorkGroupInfo seems to work (returned values seem to be rounded up towards next register-size-multiple).

I am not sure if this is a bug or my stupidity, because same software on NVIDIA-GPUs has same problems. Don't think my getWorkGroupInfo calls are wrong cause CL_KERNEL_COMPILE_WORK_GROUP_SIZE ie returns good values.

Has anybody ever gotten a real value from getWorkGroupInfo or knows another way to avoid register-spilling without hardcoding hand-counted values?

EDIT: Attached info

PS: Yes, i tried putting explicit "__private", "private" or omitting the qualifier.

Thank you very much for reporting this issue. I am able to reproduce the issue. I have reported to developers and Fix will be available in upcoming releases.

0 Likes

thank you. will try new release.

0 Likes

Revive!

So i tried and i tried - to little avail. Currently i am using:

Name:                                          Cypress
  Vendor:                                        Advanced Micro Devices, Inc.
  Driver version:                                CAL 1.4.1546
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)


but i have tried all versions i could get my hands on. Running Kernel 3.1 seems to be a problem with latest drivers.

However, can anybody tell me which version of the driver/sdk actually does have working CL_KERNEL_PRIVATE_MEM_SIZE support?

The verbose compiler output (.il files or -cl-nv-verbose) seems to have the information i crave - so i guess opencl does "know" the amount of private bytes needed for a kernel. parsing that information from .il or nv-verbose is so hackish, i just can't put something like that in my app.

The OpenCL 1.1 refs seem to indicate that the local worksize the implementation proposes for a kernel/device combination (CL_KERNEL_WORK_GROUP_SIZE) should take private memory consumption into account - it doesn't however. The section reads:

"The OpenCL implementation uses the resource requirements of the kernel (register usage etc.) to determine what this work-group size should be."

But that may just be me not having understood it right. On the other hand the refs are rather vague about the actual contents of CL_KERNEL_PRIVATE_MEM_SIZE:

"Returns the minimum amount of private memory, in bytes, used by each workitem in the kernel. This value may include any private memory needed by an implementation to execute the kernel, including that used by the language built-ins and variable declared inside the kernel with the __private qualifier."

Which to me sounds as if this value, even if it was set, would be pretty useless, as the consumption of private memory by a certain kernel/device combination may still exceed that value - which would in return mean that it could not be used to calculate a local worksize which would prevent out-of-memory errors.

After this time im wondering if nobody else has this problem - seems a common task to me to, lets say invert a matrix that has a user-specified size, and have a actual knowledge about the maximum size a user could specify without the app crashing with OUT_OF_HOST_MEMs (or spilling into uselessness).

Thanks for reading,

    sonicx

0 Likes

Private memory usage is unrelated to the input data size. We only report private memory when you use private arrays or there is register spilling. Registers themselves do not qualify for private memory. If you want to decrease your private memory usage on per kernel basis, then you should use the reqd_work_group_size and specify at compile time what you will launch at, otherwise the compiler choose the default.
0 Likes

thanks for your reply.

i am in fact using private arrays to store matrices in. and the compiler tells me it will spill registers. and i am aware that the overall input size has nothing to do with it, the local work size however i think does.

the size of that array is set by a constant, which is set by a compile-time kernel option. in this case its a krige-interpolator-kernel. the user can set the amount of neighbours to include in the interpolation. that amount defines the size of the matrices used in the kernel. depending on which gpu the user has, the kernel will spill from a certain matrix-size on. from that point on the whole thing is quite useless, because it gets really slow.

now my understanding that i have a max amount of private memory X available. depending on my local work size, each of the local work items has X/localWorkSize private memory (assuming i want them all to have an equal part of privmem). if my kernel is set up in way that the private arrays each work item has to its disposal are larger than X/localWorkSize, the compiler gives me a warning that it will spill registers, and be slower.

On my fpro7800 cards for example 11 neighbours will work, 12 will result in that spillin-message. On other cards i have different limits. I want to know that limit so i can tell the user about it, so the slowness of that spilling may be avoided.

my plan was to set my matrix sizes, compile the kernel with the thusly set  private arrays, and check how many bytes of private mem a single workitem running that compiled kernel would take. then i would check how much private bytes i can use on that card, and basically just divide it to have the amount of work-items i could have at once. then i would want to use that info to build the localWorkSize. not thinking about how many workItems i have to process all in all, just how big i can set my localWorkSize and still have no register spilling.

i attached the part of the kernel where i set up my private arrays. max_neighbours is defined at compile time by the user as said above.

i have a lot of kernels like that, and i would like to have a solid system, which allows my users to just use any gpu with any of the kernels, without having to manually calculate what parameters would work with their setup and which would result in spilling, or even worse having to iterate through possible parameters until they have found the maximum their hw can handle.

by now i think, i somehow got the whole concept of localWorkSize wrong, but as i have nobody else to ask - here i am.

ps: even with private arrays so big register spilling will happen, i don't get results from CL_KERNEL_PRIVATE_MEM_SIZE.

pps: running 11.11/2.5

__private float4 nearest[max_neighbours +1]; for(unsigned short i = 0;i < max_neighbours+1;i++) nearest = (float4){0,0,0,-100}; const unsigned short dim = (max_neighbours + 1); __private float tmp[(max_neighbours + 1) * (max_neighbours + 1) * 2];

0 Likes

*sigh* so i made a tiny test-kernel to play around and investigate my problem. now i see that the local work group size has no influence on when then spilling happens. i found the point where the test-kernel spills, but whatever reqd_work_group_size i set (or set on the C++ side), it doesn't change. sadly that doesnt solve my problem. i was under the impression that the local work group size is the amount of items that is processed in parallel,or at least corresponds to that. wrong i was it seems.

i have understood now that my thoughts were stupid in the way that the compiler tells me that the spill will happen, before i have set any work group size at all (assuming i dont use reqd_work_group_size). so the wgs can't prevent my spilling problem.

but i still don't understand how i would know how many private bytes per work item i can use until spilling happens.

#include "kernel_include.h" //#define SIZE 483 // Will not spill #define SIZE 484 // Will spill // How to know max SIZE for a gpu without brute-force-trying? // Specified wgs doesn't change a thing bout the above limit, wether insanely high or low. __kernel __attribute__((reqd_work_group_size(4,4,4))) void test( __global float * attr, __global int * value, __write_only image2d_t img ) { INIT; __private float4 a[SIZE]; // Do something so our array wont get optimized away for(int i = 0;i < SIZE;i++) a.x = i; atom_add(value,(int)a[1].x); return; }

0 Likes

What is 'INIT'?

Basically, your private array is large enough that it takes up 121 registers at a size of 484(the compiler is optimizing away yzw components, so (484 * 4) / 16), and this is pushing you over the limit that a single wavefront can utilize without spilling because registers are still required for address calculations.

The compiler can move a private array into registers IF there are registers to use, but a single wavefront is limited to ~124-128 registers depending on the chip, some chips you might get a few more and some will get a lot less, but usually its in that range.

So while wgs determines how many registers you are allowed, your private array is exceeding that limit.
0 Likes

My algorithm requires a scratch buffer per workitem. The max buffer size can be determined up front before each enqueue, and one enqueue is needed for each level of recursion. I know the registers will spill and it is what it is. Should I let it spill or should I explictly store the scrach buffer somewhere? I assume peformance is king for this discussion. Do arrays in private memory have to be fixed at compiling time?

Thanks,

 

 

 

0 Likes

NURBS,
I would highly recommend re-designing your algorithm to use local/global memory instead of scratch.
0 Likes

Thanks for your quick response. Would using global memory be the same as letting it spill on private memory? I thought global memory will be used when registers spill? 

When happens when local memory spill? 

0 Likes

NURBS,
You cannot spill local memory as it is something that is allocated by the program and if you allocate too much memory, you do not succeed in compilation. Global memory and scratch are both device memory, but global memory can be cached, scratch memory is not. So it could be quite a bit faster.
0 Likes

Thanks again. "Device Memory" is what I should have said(brain freezed)

Is there a recommended strategy to design the kernel to make the caching more effective with global memory?

 

0 Likes

Read the memory section of our programming guide. It should have all of the information you need there.

0 Likes

I had the same problem on an NVidia card :

using array => private memory reported

using plain registers => zero private memory reported (no spill)

CL_KERNEL_WORK_GROUP_SIZE allows automatic tuning but I must not try to compile with attribute reqd_work_group_size since it would have CL_KERNEL_WORK_GROUP_SIZE increased to this value (provided local memory is not exhausted) and spilling forced.

On NVidia the cLGetDeviceInfo(...CL_DEVICE_REGISTERS_PER_BLOCK_NV...) gives the size of the register file (on AMD it is 64*256*(32bits*4) AFAIK) but on both GPUs I have understood that the register addressing allows only 128 register/thread (and a handfull of them contains group_id, local_id, constant kernel args...)

The only portable (tested on 5 models of NVidia cards and 1 model of AMD card) way I found was to start from the group size given by CL_DEVICE_MAX_WORK_GROUP_SIZE compile without reqd_work_group_size attribute and check CL_KERNEL_WORK_GROUP_SIZE if below the tested group size, lower it (depending of you code constraint NOT to the value returned by CL_KERNEL_WORK_GROUP_SIZE otherwise you'll end up with a too small group size) and go on until you have CL_KERNEL_WORK_GROUP_SIZE return >= your tested value.

It is tedious to program and slow to compile, so I suggest requiring CL_DEVICE_REGISTERS_PER_BLOCK and something as CL_DEVICE_REGISTERS_PER_THREAD to be required for forecoming OpenCL specifications (just to have a good estimation of register availability to start with)

0 Likes