14 Replies Latest reply on Apr 20, 2012 7:10 AM by cantallo

    CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!

    sonicx
      CL_KERNEL_PRIVATE_MEM_SIZE returns zero, always

      Hello.

      I have a kernel that uses a LOT of private memory. The amount it uses depends and is not known until compilation (array-size set by macros). In order to avoid register-spilling i try to find the optimum work-group-size. To do so i need to know how many memory i can and need to use. However using getWorkGroupInfo after i compiled my kernel never works. Just returns zero. getWorkGroupInfo seems to work (returned values seem to be rounded up towards next register-size-multiple).

      I am not sure if this is a bug or my stupidity, because same software on NVIDIA-GPUs has same problems. Don't think my getWorkGroupInfo calls are wrong cause CL_KERNEL_COMPILE_WORK_GROUP_SIZE ie returns good values.

      Has anybody ever gotten a real value from getWorkGroupInfo or knows another way to avoid register-spilling without hardcoding hand-counted values?

      EDIT: Attached info

      PS: Yes, i tried putting explicit "__private", "private" or omitting the qualifier.

      Querying OpenCL... Searching for OpenCL platform... Found 1 platform(s): Plaform Profile: FULL_PROFILE Plaform Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Plaform Name: AMD Accelerated Parallel Processing Plaform Vendor: Advanced Micro Devices, Inc. Plaform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Plaform Name: AMD Accelerated Parallel Processing Number of devices: 3 Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 18 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 700Mhz Address bits: 32 Max memory allocation: 268435456 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1385 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 18 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 700Mhz Address bits: 32 Max memory allocation: 268435456 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1385 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 8 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 3423Mhz Address bits: 64 Max memory allocation: 2147483648 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 32768 Global memory size: 8377356288 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Vendor: GenuineIntel Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf

        • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
          genaganna

           

          Originally posted by: sonicx Hello.

          I have a kernel that uses a LOT of private memory. The amount it uses depends and is not known until compilation (array-size set by macros). In order to avoid register-spilling i try to find the optimum work-group-size. To do so i need to know how many memory i can and need to use. However using getWorkGroupInfo after i compiled my kernel never works. Just returns zero. getWorkGroupInfo seems to work (returned values seem to be rounded up towards next register-size-multiple).

          I am not sure if this is a bug or my stupidity, because same software on NVIDIA-GPUs has same problems. Don't think my getWorkGroupInfo calls are wrong cause CL_KERNEL_COMPILE_WORK_GROUP_SIZE ie returns good values.

          Has anybody ever gotten a real value from getWorkGroupInfo or knows another way to avoid register-spilling without hardcoding hand-counted values?

          EDIT: Attached info

          PS: Yes, i tried putting explicit "__private", "private" or omitting the qualifier.

          Thank you very much for reporting this issue. I am able to reproduce the issue. I have reported to developers and Fix will be available in upcoming releases.

            • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
              sonicx

              thank you. will try new release.

                • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                  sonicx

                  Revive!

                  So i tried and i tried - to little avail. Currently i am using:

                  Name:                                          Cypress
                    Vendor:                                        Advanced Micro Devices, Inc.
                    Driver version:                                CAL 1.4.1546
                    Profile:                                       FULL_PROFILE
                    Version:                                       OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)


                  but i have tried all versions i could get my hands on. Running Kernel 3.1 seems to be a problem with latest drivers.

                  However, can anybody tell me which version of the driver/sdk actually does have working CL_KERNEL_PRIVATE_MEM_SIZE support?

                  The verbose compiler output (.il files or -cl-nv-verbose) seems to have the information i crave - so i guess opencl does "know" the amount of private bytes needed for a kernel. parsing that information from .il or nv-verbose is so hackish, i just can't put something like that in my app.

                  The OpenCL 1.1 refs seem to indicate that the local worksize the implementation proposes for a kernel/device combination (CL_KERNEL_WORK_GROUP_SIZE) should take private memory consumption into account - it doesn't however. The section reads:

                  "The OpenCL implementation uses the resource requirements of the kernel (register usage etc.) to determine what this work-group size should be."

                  But that may just be me not having understood it right. On the other hand the refs are rather vague about the actual contents of CL_KERNEL_PRIVATE_MEM_SIZE:

                  "Returns the minimum amount of private memory, in bytes, used by each workitem in the kernel. This value may include any private memory needed by an implementation to execute the kernel, including that used by the language built-ins and variable declared inside the kernel with the __private qualifier."

                  Which to me sounds as if this value, even if it was set, would be pretty useless, as the consumption of private memory by a certain kernel/device combination may still exceed that value - which would in return mean that it could not be used to calculate a local worksize which would prevent out-of-memory errors.

                  After this time im wondering if nobody else has this problem - seems a common task to me to, lets say invert a matrix that has a user-specified size, and have a actual knowledge about the maximum size a user could specify without the app crashing with OUT_OF_HOST_MEMs (or spilling into uselessness).

                  Thanks for reading,

                      sonicx

              • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                MicahVillmow
                Private memory usage is unrelated to the input data size. We only report private memory when you use private arrays or there is register spilling. Registers themselves do not qualify for private memory. If you want to decrease your private memory usage on per kernel basis, then you should use the reqd_work_group_size and specify at compile time what you will launch at, otherwise the compiler choose the default.
                  • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                    sonicx

                    thanks for your reply.

                    i am in fact using private arrays to store matrices in. and the compiler tells me it will spill registers. and i am aware that the overall input size has nothing to do with it, the local work size however i think does.

                    the size of that array is set by a constant, which is set by a compile-time kernel option. in this case its a krige-interpolator-kernel. the user can set the amount of neighbours to include in the interpolation. that amount defines the size of the matrices used in the kernel. depending on which gpu the user has, the kernel will spill from a certain matrix-size on. from that point on the whole thing is quite useless, because it gets really slow.

                    now my understanding that i have a max amount of private memory X available. depending on my local work size, each of the local work items has X/localWorkSize private memory (assuming i want them all to have an equal part of privmem). if my kernel is set up in way that the private arrays each work item has to its disposal are larger than X/localWorkSize, the compiler gives me a warning that it will spill registers, and be slower.

                    On my fpro7800 cards for example 11 neighbours will work, 12 will result in that spillin-message. On other cards i have different limits. I want to know that limit so i can tell the user about it, so the slowness of that spilling may be avoided.

                    my plan was to set my matrix sizes, compile the kernel with the thusly set  private arrays, and check how many bytes of private mem a single workitem running that compiled kernel would take. then i would check how much private bytes i can use on that card, and basically just divide it to have the amount of work-items i could have at once. then i would want to use that info to build the localWorkSize. not thinking about how many workItems i have to process all in all, just how big i can set my localWorkSize and still have no register spilling.

                    i attached the part of the kernel where i set up my private arrays. max_neighbours is defined at compile time by the user as said above.

                    i have a lot of kernels like that, and i would like to have a solid system, which allows my users to just use any gpu with any of the kernels, without having to manually calculate what parameters would work with their setup and which would result in spilling, or even worse having to iterate through possible parameters until they have found the maximum their hw can handle.

                    by now i think, i somehow got the whole concept of localWorkSize wrong, but as i have nobody else to ask - here i am.

                    ps: even with private arrays so big register spilling will happen, i don't get results from CL_KERNEL_PRIVATE_MEM_SIZE.

                    pps: running 11.11/2.5

                    __private float4 nearest[max_neighbours +1]; for(unsigned short i = 0;i < max_neighbours+1;i++) nearest[i] = (float4){0,0,0,-100}; const unsigned short dim = (max_neighbours + 1); __private float tmp[(max_neighbours + 1) * (max_neighbours + 1) * 2];

                      • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                        sonicx

                        *sigh* so i made a tiny test-kernel to play around and investigate my problem. now i see that the local work group size has no influence on when then spilling happens. i found the point where the test-kernel spills, but whatever reqd_work_group_size i set (or set on the C++ side), it doesn't change. sadly that doesnt solve my problem. i was under the impression that the local work group size is the amount of items that is processed in parallel,or at least corresponds to that. wrong i was it seems.

                        i have understood now that my thoughts were stupid in the way that the compiler tells me that the spill will happen, before i have set any work group size at all (assuming i dont use reqd_work_group_size). so the wgs can't prevent my spilling problem.

                        but i still don't understand how i would know how many private bytes per work item i can use until spilling happens.

                        #include "kernel_include.h" //#define SIZE 483 // Will not spill #define SIZE 484 // Will spill // How to know max SIZE for a gpu without brute-force-trying? // Specified wgs doesn't change a thing bout the above limit, wether insanely high or low. __kernel __attribute__((reqd_work_group_size(4,4,4))) void test( __global float * attr, __global int * value, __write_only image2d_t img ) { INIT; __private float4 a[SIZE]; // Do something so our array wont get optimized away for(int i = 0;i < SIZE;i++) a[i].x = i; atom_add(value,(int)a[1].x); return; }

                    • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                      MicahVillmow
                      What is 'INIT'?

                      Basically, your private array is large enough that it takes up 121 registers at a size of 484(the compiler is optimizing away yzw components, so (484 * 4) / 16), and this is pushing you over the limit that a single wavefront can utilize without spilling because registers are still required for address calculations.

                      The compiler can move a private array into registers IF there are registers to use, but a single wavefront is limited to ~124-128 registers depending on the chip, some chips you might get a few more and some will get a lot less, but usually its in that range.

                      So while wgs determines how many registers you are allowed, your private array is exceeding that limit.
                        • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                          NURBS

                          My algorithm requires a scratch buffer per workitem. The max buffer size can be determined up front before each enqueue, and one enqueue is needed for each level of recursion. I know the registers will spill and it is what it is. Should I let it spill or should I explictly store the scrach buffer somewhere? I assume peformance is king for this discussion. Do arrays in private memory have to be fixed at compiling time?

                          Thanks,

                           

                           

                           

                        • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                          MicahVillmow
                          NURBS,
                          I would highly recommend re-designing your algorithm to use local/global memory instead of scratch.
                          • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                            MicahVillmow
                            NURBS,
                            You cannot spill local memory as it is something that is allocated by the program and if you allocate too much memory, you do not succeed in compilation. Global memory and scratch are both device memory, but global memory can be cached, scratch memory is not. So it could be quite a bit faster.
                            • CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                              MicahVillmow
                              Read the memory section of our programming guide. It should have all of the information you need there.

                                • Re: CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!
                                  cantallo

                                  I had the same problem on an NVidia card :

                                  using array => private memory reported

                                  using plain registers => zero private memory reported (no spill)

                                  CL_KERNEL_WORK_GROUP_SIZE allows automatic tuning but I must not try to compile with attribute reqd_work_group_size since it would have CL_KERNEL_WORK_GROUP_SIZE increased to this value (provided local memory is not exhausted) and spilling forced.

                                   

                                  On NVidia the cLGetDeviceInfo(...CL_DEVICE_REGISTERS_PER_BLOCK_NV...) gives the size of the register file (on AMD it is 64*256*(32bits*4) AFAIK) but on both GPUs I have understood that the register addressing allows only 128 register/thread (and a handfull of them contains group_id, local_id, constant kernel args...)

                                   

                                   

                                   

                                  The only portable (tested on 5 models of NVidia cards and 1 model of AMD card) way I found was to start from the group size given by CL_DEVICE_MAX_WORK_GROUP_SIZE compile without reqd_work_group_size attribute and check CL_KERNEL_WORK_GROUP_SIZE if below the tested group size, lower it (depending of you code constraint NOT to the value returned by CL_KERNEL_WORK_GROUP_SIZE otherwise you'll end up with a too small group size) and go on until you have CL_KERNEL_WORK_GROUP_SIZE return >= your tested value.

                                   

                                  It is tedious to program and slow to compile, so I suggest requiring CL_DEVICE_REGISTERS_PER_BLOCK and something as CL_DEVICE_REGISTERS_PER_THREAD to be required for forecoming OpenCL specifications (just to have a good estimation of register availability to start with)