3 Replies Latest reply on Aug 17, 2012 10:04 PM by pengx

    [clAmdBlas] Some problems using clAmdBlasTune tool

    pengx

      Hi all,

       

      I am trying to import clAmdBlas library (particularly the GEMM algorithm) into OpenCV's OpenCL module.

      The tune program looks to be the right thing I looked for improving algorithm performance. However it turns to crash in the middle of running process. The execution command is like this:

       

      bin32/clAmdBlasTune.exe --store-kernels --float --GEMM

       

      Every time I run the program it will crash at a percentage of GEMM tuning process. When I start it over again the percentage will increase a bit, but will then crash as usual.

      I am using 64bit Windows 7 system, however; I used 32bit tune program as I need 32bit libraries for the OpenCV project.

      Alternatively, if I run 64bit tune program, there is another problem arises, which is some clc kernel compiling error like this:

      \Users\CARLZH~1\AppData\Local\Temp\OCLEDFF.tmp.cl", line 342: error: a

             value of type "float4" cannot be assigned to an entity of type "int"

         pC[mad24(7u, ldc, 3u)] = tempC7;

                                ^

       

       

      errors detected in the compilation of "C:\Users\CARLZH~1\AppData\Local\Temp\O

      DFF.tmp.cl".


      ernal error: clc compiler invocation failed.

       

      Here is my clinfo. I am on a windows 7 64bit. I have AMD APP SDK 2.7 and clAmdBlas v1.8 beta installed; there is also another platform supported by Intel OpenCL SDK. My graphic card is ATI Mobility Radeon HD 5650.

       

      ernal error: clc compiler invocation failed.

      Number of platforms:                             2

        Platform Profile:                              FULL_PROFILE

        Platform Version:                              OpenCL 1.1

        Platform Name:                                 Intel(R) OpenCL

        Platform Vendor:                               Intel(R) Corporation

        Platform Extensions:                           cl_khr_fp64 cl_khr_global_int32

      _base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomi

      cs cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_pr

      intf cl_ext_device_fission cl_intel_immediate_execution cl_khr_gl_sharing cl_khr

      _icd

        Platform Profile:                              FULL_PROFILE

        Platform Version:                              OpenCL 1.2 AMD-APP (938.1)

        Platform Name:                                 AMD Accelerated Parallel Proces

      sing

        Platform Vendor:                               Advanced Micro Devices, Inc.

        Platform Extensions:                           cl_khr_icd cl_amd_event_callbac

      k cl_amd_offline_devices cl_khr_d3d10_sharing

       

       

       

       

        Platform Name:                                 Intel(R) OpenCL

      Number of devices:                               1

        Device Type:                                   CL_DEVICE_TYPE_CPU

        Device ID:                                     32902

        Max compute units:                             4

        Max work items dimensions:                     3

          Max work items[0]:                           1024

          Max work items[1]:                           1024

          Max work items[2]:                           1024

        Max work group size:                           1024

        Preferred vector width char:                   16

        Preferred vector width short:                  8

        Preferred vector width int:                    4

        Preferred vector width long:                   2

        Preferred vector width float:                  4

        Preferred vector width double:                 2

        Native vector width char:                      16

        Native vector width short:                     8

        Native vector width int:                       4

        Native vector width long:                      2

        Native vector width float:                     4

        Native vector width double:                    2

        Max clock frequency:                           2670Mhz

        Address bits:                                  64

        Max memory allocation:                         1574616064

        Image support:                                 Yes

        Max number of images read arguments:           128

        Max number of images write arguments:          128

        Max image 2D width:                            8192

        Max image 2D height:                           8192

        Max image 3D width:                            2048

        Max image 3D height:                           2048

        Max image 3D depth:                            2048

        Max samplers within kernel:                    128

        Max size of kernel argument:                   1024

        Alignment (bits) of base address:              1024

        Minimum alignment (bytes) for any datatype:    128

        Single precision floating point capability

          Denorms:                                     Yes

          Quiet NaNs:                                  Yes

          Round to nearest even:                       Yes

          Round to zero:                               No

          Round to +ve and infinity:                   No

          IEEE754-2008 fused multiply-add:             No

        Cache type:                                    Read/Write

        Cache line size:                               64

        Cache size:                                    262144

        Global memory size:                            6298464256

        Constant buffer size:                          131072

        Max number of constant args:                   128

        Local memory type:                             Global

        Local memory size:                             32768

        Kernel Preferred work group size multiple:     128

        Error correction support:                      0

        Unified memory for Host and Device:            1

        Profiling timer resolution:                    384

        Device endianess:                              Little

        Available:                                     Yes

        Compiler available:                            Yes

        Execution capabilities:

          Execute OpenCL kernels:                      Yes

          Execute native function:                     Yes

        Queue properties:

          Out-of-Order:                                Yes

          Profiling :                                  Yes

        Platform ID:                                   00000000000683B0

        Name:                                          Intel(R) Core(TM) i5 CPU

      M 480  @ 2.67GHz

        Vendor:                                        Intel(R) Corporation

        Device OpenCL C version:                       OpenCL C 1.1

        Driver version:                                1.1

        Profile:                                       FULL_PROFILE

        Version:                                       OpenCL 1.1 (Build 15293.6650)

        Extensions:                                    cl_khr_fp64 cl_khr_global_int32

      _base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomi

      cs cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_pr

      intf cl_ext_device_fission cl_intel_immediate_execution cl_khr_gl_sharing

       

       

       

       

        Platform Name:                                 AMD Accelerated Parallel Proces

      sing

      Number of devices:                               2

        Device Type:                                   CL_DEVICE_TYPE_GPU

        Device ID:                                     4098

        Board name:                                    AMD Radeon HD 6500M/5600/5700 S

      eries

        Max compute units:                             5

        Max work items dimensions:                     3

          Max work items[0]:                           256

          Max work items[1]:                           256

          Max work items[2]:                           256

        Max work group size:                           256

        Preferred vector width char:                   16

        Preferred vector width short:                  8

        Preferred vector width int:                    4

        Preferred vector width long:                   2

        Preferred vector width float:                  4

        Preferred vector width double:                 0

        Native vector width char:                      16

        Native vector width short:                     8

        Native vector width int:                       4

        Native vector width long:                      2

        Native vector width float:                     4

        Native vector width double:                    0

        Max clock frequency:                           450Mhz

        Address bits:                                  32

        Max memory allocation:                         536870912

        Image support:                                 Yes

        Max number of images read arguments:           128

        Max number of images write arguments:          8

        Max image 2D width:                            8192

        Max image 2D height:                           8192

        Max image 3D width:                            2048

        Max image 3D height:                           2048

        Max image 3D depth:                            2048

        Max samplers within kernel:                    16

        Max size of kernel argument:                   1024

        Alignment (bits) of base address:              2048

        Minimum alignment (bytes) for any datatype:    128

        Single precision floating point capability

          Denorms:                                     No

          Quiet NaNs:                                  Yes

          Round to nearest even:                       Yes

          Round to zero:                               Yes

          Round to +ve and infinity:                   Yes

          IEEE754-2008 fused multiply-add:             Yes

        Cache type:                                    None

        Cache line size:                               0

        Cache size:                                    0

        Global memory size:                            1073741824

        Constant buffer size:                          65536

        Max number of constant args:                   8

        Local memory type:                             Scratchpad

        Local memory size:                             32768

        Kernel Preferred work group size multiple:     64

        Error correction support:                      0

        Unified memory for Host and Device:            0

        Profiling timer resolution:                    1

        Device endianess:                              Little

        Available:                                     Yes

        Compiler available:                            Yes

        Execution capabilities:

          Execute OpenCL kernels:                      Yes

          Execute native function:                     No

        Queue properties:

          Out-of-Order:                                No

          Profiling :                                  Yes

        Platform ID:                                   000007FEDFF82A08

        Name:                                          Redwood

        Vendor:                                        Advanced Micro Devices, Inc.

        Device OpenCL C version:                       OpenCL C 1.2

        Driver version:                                CAL 1.4.1741 (VM)

        Profile:                                       FULL_PROFILE

        Version:                                       OpenCL 1.2 AMD-APP (938.1)

        Extensions:                                    cl_khr_global_int32_base_atomic

      s cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo

      cal_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store

      cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd

      _vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing

       

       

       

       

        Device Type:                                   CL_DEVICE_TYPE_CPU

        Device ID:                                     4098

        Board name:

        Max compute units:                             4

        Max work items dimensions:                     3

          Max work items[0]:                           1024

          Max work items[1]:                           1024

          Max work items[2]:                           1024

        Max work group size:                           1024

        Preferred vector width char:                   16

        Preferred vector width short:                  8

        Preferred vector width int:                    4

        Preferred vector width long:                   2

        Preferred vector width float:                  4

        Preferred vector width double:                 0

        Native vector width char:                      16

        Native vector width short:                     8

        Native vector width int:                       4

        Native vector width long:                      2

        Native vector width float:                     4

        Native vector width double:                    0

        Max clock frequency:                           2660Mhz

        Address bits:                                  64

        Max memory allocation:                         2147483648

        Image support:                                 Yes

        Max number of images read arguments:           128

        Max number of images write arguments:          8

        Max image 2D width:                            8192

        Max image 2D height:                           8192

        Max image 3D width:                            2048

        Max image 3D height:                           2048

        Max image 3D depth:                            2048

        Max samplers within kernel:                    16

        Max size of kernel argument:                   4096

        Alignment (bits) of base address:              1024

        Minimum alignment (bytes) for any datatype:    128

        Single precision floating point capability

          Denorms:                                     Yes

          Quiet NaNs:                                  Yes

          Round to nearest even:                       Yes

          Round to zero:                               Yes

          Round to +ve and infinity:                   Yes

          IEEE754-2008 fused multiply-add:             Yes

        Cache type:                                    Read/Write

        Cache line size:                               64

        Cache size:                                    32768

        Global memory size:                            6298464256

        Constant buffer size:                          65536

        Max number of constant args:                   8

        Local memory type:                             Global

        Local memory size:                             32768

        Kernel Preferred work group size multiple:     1

        Error correction support:                      0

        Unified memory for Host and Device:            1

        Profiling timer resolution:                    384

        Device endianess:                              Little

        Available:                                     Yes

        Compiler available:                            Yes

        Execution capabilities:

          Execute OpenCL kernels:                      Yes

          Execute native function:                     Yes

        Queue properties:

          Out-of-Order:                                No

          Profiling :                                  Yes

        Platform ID:                                   000007FEDFF82A08

        Name:                                          Intel(R) Core(TM) i5 CPU

      M 480  @ 2.67GHz

        Vendor:                                        GenuineIntel

        Device OpenCL C version:                       OpenCL C 1.2

        Driver version:                                2.0 (sse2)

        Profile:                                       FULL_PROFILE

        Version:                                       OpenCL 1.2 AMD-APP (938.1)

        Extensions:                                    cl_khr_fp64 cl_amd_fp64 cl_khr_

      global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3

      2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_

      khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ex

      t_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_

      media_ops cl_amd_popcnt cl_khr_d3d10_sharing

      Thanks!

      Peng