cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pengx
Journeyman III

[clAmdBlas] Some problems using clAmdBlasTune tool

Hi all,

I am trying to import clAmdBlas library (particularly the GEMM algorithm) into OpenCV's OpenCL module.

The tune program looks to be the right thing I looked for improving algorithm performance. However it turns to crash in the middle of running process. The execution command is like this:

bin32/clAmdBlasTune.exe --store-kernels --float --GEMM

Every time I run the program it will crash at a percentage of GEMM tuning process. When I start it over again the percentage will increase a bit, but will then crash as usual.

I am using 64bit Windows 7 system, however; I used 32bit tune program as I need 32bit libraries for the OpenCV project.

Alternatively, if I run 64bit tune program, there is another problem arises, which is some clc kernel compiling error like this:

\Users\CARLZH~1\AppData\Local\Temp\OCLEDFF.tmp.cl", line 342: error: a

       value of type "float4" cannot be assigned to an entity of type "int"

   pC[mad24(7u, ldc, 3u)] = tempC7;

                          ^

errors detected in the compilation of "C:\Users\CARLZH~1\AppData\Local\Temp\O

DFF.tmp.cl".


ernal error: clc compiler invocation failed.

Here is my clinfo. I am on a windows 7 64bit. I have AMD APP SDK 2.7 and clAmdBlas v1.8 beta installed; there is also another platform supported by Intel OpenCL SDK. My graphic card is ATI Mobility Radeon HD 5650.

ernal error: clc compiler invocation failed.

Number of platforms:                             2

  Platform Profile:                              FULL_PROFILE

  Platform Version:                              OpenCL 1.1

  Platform Name:                                 Intel(R) OpenCL

  Platform Vendor:                               Intel(R) Corporation

  Platform Extensions:                           cl_khr_fp64 cl_khr_global_int32

_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomi

cs cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_pr

intf cl_ext_device_fission cl_intel_immediate_execution cl_khr_gl_sharing cl_khr

_icd

  Platform Profile:                              FULL_PROFILE

  Platform Version:                              OpenCL 1.2 AMD-APP (938.1)

  Platform Name:                                 AMD Accelerated Parallel Proces

sing

  Platform Vendor:                               Advanced Micro Devices, Inc.

  Platform Extensions:                           cl_khr_icd cl_amd_event_callbac

k cl_amd_offline_devices cl_khr_d3d10_sharing

  Platform Name:                                 Intel(R) OpenCL

Number of devices:                               1

  Device Type:                                   CL_DEVICE_TYPE_CPU

  Device ID:                                     32902

  Max compute units:                             4

  Max work items dimensions:                     3

    Max work items[0]:                           1024

    Max work items[1]:                           1024

    Max work items[2]:                           1024

  Max work group size:                           1024

  Preferred vector width char:                   16

  Preferred vector width short:                  8

  Preferred vector width int:                    4

  Preferred vector width long:                   2

  Preferred vector width float:                  4

  Preferred vector width double:                 2

  Native vector width char:                      16

  Native vector width short:                     8

  Native vector width int:                       4

  Native vector width long:                      2

  Native vector width float:                     4

  Native vector width double:                    2

  Max clock frequency:                           2670Mhz

  Address bits:                                  64

  Max memory allocation:                         1574616064

  Image support:                                 Yes

  Max number of images read arguments:           128

  Max number of images write arguments:          128

  Max image 2D width:                            8192

  Max image 2D height:                           8192

  Max image 3D width:                            2048

  Max image 3D height:                           2048

  Max image 3D depth:                            2048

  Max samplers within kernel:                    128

  Max size of kernel argument:                   1024

  Alignment (bits) of base address:              1024

  Minimum alignment (bytes) for any datatype:    128

  Single precision floating point capability

    Denorms:                                     Yes

    Quiet NaNs:                                  Yes

    Round to nearest even:                       Yes

    Round to zero:                               No

    Round to +ve and infinity:                   No

    IEEE754-2008 fused multiply-add:             No

  Cache type:                                    Read/Write

  Cache line size:                               64

  Cache size:                                    262144

  Global memory size:                            6298464256

  Constant buffer size:                          131072

  Max number of constant args:                   128

  Local memory type:                             Global

  Local memory size:                             32768

  Kernel Preferred work group size multiple:     128

  Error correction support:                      0

  Unified memory for Host and Device:            1

  Profiling timer resolution:                    384

  Device endianess:                              Little

  Available:                                     Yes

  Compiler available:                            Yes

  Execution capabilities:

    Execute OpenCL kernels:                      Yes

    Execute native function:                     Yes

  Queue properties:

    Out-of-Order:                                Yes

    Profiling :                                  Yes

  Platform ID:                                   00000000000683B0

  Name:                                          Intel(R) Core(TM) i5 CPU

M 480  @ 2.67GHz

  Vendor:                                        Intel(R) Corporation

  Device OpenCL C version:                       OpenCL C 1.1

  Driver version:                                1.1

  Profile:                                       FULL_PROFILE

  Version:                                       OpenCL 1.1 (Build 15293.6650)

  Extensions:                                    cl_khr_fp64 cl_khr_global_int32

_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomi

cs cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_pr

intf cl_ext_device_fission cl_intel_immediate_execution cl_khr_gl_sharing

  Platform Name:                                 AMD Accelerated Parallel Proces

sing

Number of devices:                               2

  Device Type:                                   CL_DEVICE_TYPE_GPU

  Device ID:                                     4098

  Board name:                                    AMD Radeon HD 6500M/5600/5700 S

eries

  Max compute units:                             5

  Max work items dimensions:                     3

    Max work items[0]:                           256

    Max work items[1]:                           256

    Max work items[2]:                           256

  Max work group size:                           256

  Preferred vector width char:                   16

  Preferred vector width short:                  8

  Preferred vector width int:                    4

  Preferred vector width long:                   2

  Preferred vector width float:                  4

  Preferred vector width double:                 0

  Native vector width char:                      16

  Native vector width short:                     8

  Native vector width int:                       4

  Native vector width long:                      2

  Native vector width float:                     4

  Native vector width double:                    0

  Max clock frequency:                           450Mhz

  Address bits:                                  32

  Max memory allocation:                         536870912

  Image support:                                 Yes

  Max number of images read arguments:           128

  Max number of images write arguments:          8

  Max image 2D width:                            8192

  Max image 2D height:                           8192

  Max image 3D width:                            2048

  Max image 3D height:                           2048

  Max image 3D depth:                            2048

  Max samplers within kernel:                    16

  Max size of kernel argument:                   1024

  Alignment (bits) of base address:              2048

  Minimum alignment (bytes) for any datatype:    128

  Single precision floating point capability

    Denorms:                                     No

    Quiet NaNs:                                  Yes

    Round to nearest even:                       Yes

    Round to zero:                               Yes

    Round to +ve and infinity:                   Yes

    IEEE754-2008 fused multiply-add:             Yes

  Cache type:                                    None

  Cache line size:                               0

  Cache size:                                    0

  Global memory size:                            1073741824

  Constant buffer size:                          65536

  Max number of constant args:                   8

  Local memory type:                             Scratchpad

  Local memory size:                             32768

  Kernel Preferred work group size multiple:     64

  Error correction support:                      0

  Unified memory for Host and Device:            0

  Profiling timer resolution:                    1

  Device endianess:                              Little

  Available:                                     Yes

  Compiler available:                            Yes

  Execution capabilities:

    Execute OpenCL kernels:                      Yes

    Execute native function:                     No

  Queue properties:

    Out-of-Order:                                No

    Profiling :                                  Yes

  Platform ID:                                   000007FEDFF82A08

  Name:                                          Redwood

  Vendor:                                        Advanced Micro Devices, Inc.

  Device OpenCL C version:                       OpenCL C 1.2

  Driver version:                                CAL 1.4.1741 (VM)

  Profile:                                       FULL_PROFILE

  Version:                                       OpenCL 1.2 AMD-APP (938.1)

  Extensions:                                    cl_khr_global_int32_base_atomic

s cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo

cal_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store

cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd

_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing

  Device Type:                                   CL_DEVICE_TYPE_CPU

  Device ID:                                     4098

  Board name:

  Max compute units:                             4

  Max work items dimensions:                     3

    Max work items[0]:                           1024

    Max work items[1]:                           1024

    Max work items[2]:                           1024

  Max work group size:                           1024

  Preferred vector width char:                   16

  Preferred vector width short:                  8

  Preferred vector width int:                    4

  Preferred vector width long:                   2

  Preferred vector width float:                  4

  Preferred vector width double:                 0

  Native vector width char:                      16

  Native vector width short:                     8

  Native vector width int:                       4

  Native vector width long:                      2

  Native vector width float:                     4

  Native vector width double:                    0

  Max clock frequency:                           2660Mhz

  Address bits:                                  64

  Max memory allocation:                         2147483648

  Image support:                                 Yes

  Max number of images read arguments:           128

  Max number of images write arguments:          8

  Max image 2D width:                            8192

  Max image 2D height:                           8192

  Max image 3D width:                            2048

  Max image 3D height:                           2048

  Max image 3D depth:                            2048

  Max samplers within kernel:                    16

  Max size of kernel argument:                   4096

  Alignment (bits) of base address:              1024

  Minimum alignment (bytes) for any datatype:    128

  Single precision floating point capability

    Denorms:                                     Yes

    Quiet NaNs:                                  Yes

    Round to nearest even:                       Yes

    Round to zero:                               Yes

    Round to +ve and infinity:                   Yes

    IEEE754-2008 fused multiply-add:             Yes

  Cache type:                                    Read/Write

  Cache line size:                               64

  Cache size:                                    32768

  Global memory size:                            6298464256

  Constant buffer size:                          65536

  Max number of constant args:                   8

  Local memory type:                             Global

  Local memory size:                             32768

  Kernel Preferred work group size multiple:     1

  Error correction support:                      0

  Unified memory for Host and Device:            1

  Profiling timer resolution:                    384

  Device endianess:                              Little

  Available:                                     Yes

  Compiler available:                            Yes

  Execution capabilities:

    Execute OpenCL kernels:                      Yes

    Execute native function:                     Yes

  Queue properties:

    Out-of-Order:                                No

    Profiling :                                  Yes

  Platform ID:                                   000007FEDFF82A08

  Name:                                          Intel(R) Core(TM) i5 CPU

M 480  @ 2.67GHz

  Vendor:                                        GenuineIntel

  Device OpenCL C version:                       OpenCL C 1.2

  Driver version:                                2.0 (sse2)

  Profile:                                       FULL_PROFILE

  Version:                                       OpenCL 1.2 AMD-APP (938.1)

  Extensions:                                    cl_khr_fp64 cl_amd_fp64 cl_khr_

global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3

2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_

khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ex

t_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_

media_ops cl_amd_popcnt cl_khr_d3d10_sharing

Thanks!

Peng

0 Likes
3 Replies
solver
Adept I

I ran clAmdBlasTune.exe on Tahiti. Unfortunately, I could not reproduce your problem. By the way, the commandline parameters passed to clAmdBlasTune.exe are case sensitive: --GEMM should be --gemm.

0 Likes

solver wrote:

I ran clAmdBlasTune.exe on Tahiti. Unfortunately, I could not reproduce your problem. By the way, the commandline parameters passed to clAmdBlasTune.exe are case sensitive: --GEMM should be --gemm.

Are you saying that this program is able to complete on 64bit Windows but not 32bit Windows? (did you test both?)

I also had several problems with tune tool on Linux as well...It appears to be notoriously unreliable.

http://devgurus.amd.com/message/1281490

0 Likes

I read the thread you posted but there seems no solution to my problem. The engineer said they could complete the tuning on their machines. However I have tried several machines with different architectures and specs, none could finish at 100%.

0 Likes