sonicx

CL_KERNEL_PRIVATE_MEM_SIZE - bug or stupidity?!

Discussion created by sonicx on Aug 15, 2011
Latest reply on Apr 20, 2012 by cantallo
CL_KERNEL_PRIVATE_MEM_SIZE returns zero, always

Hello.

I have a kernel that uses a LOT of private memory. The amount it uses depends and is not known until compilation (array-size set by macros). In order to avoid register-spilling i try to find the optimum work-group-size. To do so i need to know how many memory i can and need to use. However using getWorkGroupInfo after i compiled my kernel never works. Just returns zero. getWorkGroupInfo seems to work (returned values seem to be rounded up towards next register-size-multiple).

I am not sure if this is a bug or my stupidity, because same software on NVIDIA-GPUs has same problems. Don't think my getWorkGroupInfo calls are wrong cause CL_KERNEL_COMPILE_WORK_GROUP_SIZE ie returns good values.

Has anybody ever gotten a real value from getWorkGroupInfo or knows another way to avoid register-spilling without hardcoding hand-counted values?

EDIT: Attached info

PS: Yes, i tried putting explicit "__private", "private" or omitting the qualifier.

Querying OpenCL... Searching for OpenCL platform... Found 1 platform(s): Plaform Profile: FULL_PROFILE Plaform Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Plaform Name: AMD Accelerated Parallel Processing Plaform Vendor: Advanced Micro Devices, Inc. Plaform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Plaform Name: AMD Accelerated Parallel Processing Number of devices: 3 Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 18 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 700Mhz Address bits: 32 Max memory allocation: 268435456 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1385 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 18 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 700Mhz Address bits: 32 Max memory allocation: 268435456 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1385 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 8 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 3423Mhz Address bits: 64 Max memory allocation: 2147483648 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 32768 Global memory size: 8377356288 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7fd082579800 Name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Vendor: GenuineIntel Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf

Outcomes