Hi all,
I am trying to import clAmdBlas library (particularly the GEMM algorithm) into OpenCV's OpenCL module.
The tune program looks to be the right thing I looked for improving algorithm performance. However it turns to crash in the middle of running process. The execution command is like this:
bin32/clAmdBlasTune.exe --store-kernels --float --GEMM
Every time I run the program it will crash at a percentage of GEMM tuning process. When I start it over again the percentage will increase a bit, but will then crash as usual.
I am using 64bit Windows 7 system, however; I used 32bit tune program as I need 32bit libraries for the OpenCV project.
Alternatively, if I run 64bit tune program, there is another problem arises, which is some clc kernel compiling error like this:
\Users\CARLZH~1\AppData\Local\Temp\OCLEDFF.tmp.cl", line 342: error: a
value of type "float4" cannot be assigned to an entity of type "int"
pC[mad24(7u, ldc, 3u)] = tempC7;
^
errors detected in the compilation of "C:\Users\CARLZH~1\AppData\Local\Temp\O
DFF.tmp.cl".
ernal error: clc compiler invocation failed.
Here is my clinfo. I am on a windows 7 64bit. I have AMD APP SDK 2.7 and clAmdBlas v1.8 beta installed; there is also another platform supported by Intel OpenCL SDK. My graphic card is ATI Mobility Radeon HD 5650.
ernal error: clc compiler invocation failed.
Number of platforms: 2
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.1
Platform Name: Intel(R) OpenCL
Platform Vendor: Intel(R) Corporation
Platform Extensions: cl_khr_fp64 cl_khr_global_int32
_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomi
cs cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_pr
intf cl_ext_device_fission cl_intel_immediate_execution cl_khr_gl_sharing cl_khr
_icd
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 AMD-APP (938.1)
Platform Name: AMD Accelerated Parallel Proces
sing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callbac
k cl_amd_offline_devices cl_khr_d3d10_sharing
Platform Name: Intel(R) OpenCL
Number of devices: 1
Device Type: CL_DEVICE_TYPE_CPU
Device ID: 32902
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 2670Mhz
Address bits: 64
Max memory allocation: 1574616064
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 128
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 128
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: No
Cache type: Read/Write
Cache line size: 64
Cache size: 262144
Global memory size: 6298464256
Constant buffer size: 131072
Max number of constant args: 128
Local memory type: Global
Local memory size: 32768
Kernel Preferred work group size multiple: 128
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 384
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00000000000683B0
Name: Intel(R) Core(TM) i5 CPU
M 480 @ 2.67GHz
Vendor: Intel(R) Corporation
Device OpenCL C version: OpenCL C 1.1
Driver version: 1.1
Profile: FULL_PROFILE
Version: OpenCL 1.1 (Build 15293.6650)
Extensions: cl_khr_fp64 cl_khr_global_int32
_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomi
cs cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_pr
intf cl_ext_device_fission cl_intel_immediate_execution cl_khr_gl_sharing
Platform Name: AMD Accelerated Parallel Proces
sing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Board name: AMD Radeon HD 6500M/5600/5700 S
eries
Max compute units: 5
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 450Mhz
Address bits: 32
Max memory allocation: 536870912
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 1073741824
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 000007FEDFF82A08
Name: Redwood
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: CAL 1.4.1741 (VM)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (938.1)
Extensions: cl_khr_global_int32_base_atomic
s cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo
cal_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store
cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd
_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing
Device Type: CL_DEVICE_TYPE_CPU
Device ID: 4098
Board name:
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 2660Mhz
Address bits: 64
Max memory allocation: 2147483648
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 4096
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 32768
Global memory size: 6298464256
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 32768
Kernel Preferred work group size multiple: 1
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 384
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 000007FEDFF82A08
Name: Intel(R) Core(TM) i5 CPU
M 480 @ 2.67GHz
Vendor: GenuineIntel
Device OpenCL C version: OpenCL C 1.2
Driver version: 2.0 (sse2)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (938.1)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_
global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3
2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_
khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ex
t_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_
media_ops cl_amd_popcnt cl_khr_d3d10_sharing
Thanks!
Peng
I ran clAmdBlasTune.exe on Tahiti. Unfortunately, I could not reproduce your problem. By the way, the commandline parameters passed to clAmdBlasTune.exe are case sensitive: --GEMM should be --gemm.
solver wrote:
I ran clAmdBlasTune.exe on Tahiti. Unfortunately, I could not reproduce your problem. By the way, the commandline parameters passed to clAmdBlasTune.exe are case sensitive: --GEMM should be --gemm.
Are you saying that this program is able to complete on 64bit Windows but not 32bit Windows? (did you test both?)
I also had several problems with tune tool on Linux as well...It appears to be notoriously unreliable.
I read the thread you posted but there seems no solution to my problem. The engineer said they could complete the tuning on their machines. However I have tried several machines with different architectures and specs, none could finish at 100%.