OpenCL

jinchuantang · ‎05-29-2023

Dear Dipak,

Recently, I am trying to produce some tuning results on AMD/NVIDIA/INTEL/APPLE GPUs for the CLBlast library (famous for fast GEMM) so that the community could receive the best performance without further tuning. However, in some cases, the latest released drivers (date: 2023/5/24) for 5700G APU (Shared 16GB memory) and RX 580 8GB 2048SP GPU will freeze for some tuning cases. The tunning here is the thing where CLBlast uses different parameters for GEMM and other Blas 1/2/3/x kernels and compiles them to test the performance in terms of GFLOPs and so forth.

For 5700G APU, it will freeze while testing different parameters in the following commands then could not continue for all the parameter cases (3232 meanings complex-float numbers):

clbast_tuner_xgemm.exe -precision 3232

it also freezes and then produces some erratic results for some final cases while executing (6464 means complex-double cases):

clbast_tuner_xgemm.exe -precision 6464

For RX580 GPU, it will freeze and could not continue to produce all the results for different parameters:

clbast_tuner_xgemm.exe -precision 6464

Regardless of those, I have still reported the results to CLBlast so it could use available tunning results for its future release:

New tuning results · Issue #1 · CNugteren/CLBlast (github.com)

In the meantime, to produce the results:

1. Goto "github.com/CNugteren/CLBlast/releases/tag/1.6.0" to download CLBlast-1.6.0-windows-x64.7z

2. Copy clblast.dll in the lib directory to bin directory

3. execute "clbast_tuner_xgemm.exe -precision 3232" command in the command line (not powershell).

In addition, the followings are all the test cases in case you would like to test all of them (you can create a .bat file and put the commands inside):

Best wishes,

Jinchuan Tang

bat file:

clblast_tuner_copy_fast.exe -precision 32
clblast_tuner_copy_fast.exe -precision 64
clblast_tuner_copy_fast.exe -precision 3232
clblast_tuner_copy_fast.exe -precision 6464
clblast_tuner_copy_fast.exe -precision 16
clblast_tuner_copy_pad.exe -precision 32
clblast_tuner_copy_pad.exe -precision 64
clblast_tuner_copy_pad.exe -precision 3232
clblast_tuner_copy_pad.exe -precision 6464
clblast_tuner_copy_pad.exe -precision 16
clblast_tuner_transpose_fast.exe -precision 32
clblast_tuner_transpose_fast.exe -precision 64
clblast_tuner_transpose_fast.exe -precision 3232
clblast_tuner_transpose_fast.exe -precision 6464
clblast_tuner_transpose_fast.exe -precision 16
clblast_tuner_transpose_pad.exe -precision 32
clblast_tuner_transpose_pad.exe -precision 64
clblast_tuner_transpose_pad.exe -precision 3232
clblast_tuner_transpose_pad.exe -precision 6464
clblast_tuner_transpose_pad.exe -precision 16
clblast_tuner_xaxpy.exe -precision 32
clblast_tuner_xaxpy.exe -precision 64
clblast_tuner_xaxpy.exe -precision 3232
clblast_tuner_xaxpy.exe -precision 6464
clblast_tuner_xaxpy.exe -precision 16
clblast_tuner_xdot.exe -precision 32
clblast_tuner_xdot.exe -precision 64
clblast_tuner_xdot.exe -precision 3232
clblast_tuner_xdot.exe -precision 6464
clblast_tuner_xdot.exe -precision 16
clblast_tuner_xger.exe -precision 32
clblast_tuner_xger.exe -precision 64
clblast_tuner_xger.exe -precision 3232
clblast_tuner_xger.exe -precision 6464
clblast_tuner_xger.exe -precision 16
clblast_tuner_xgemm.exe -precision 32
clblast_tuner_xgemm.exe -precision 64
clblast_tuner_xgemm.exe -precision 3232
clblast_tuner_xgemm.exe -precision 6464
clblast_tuner_xgemm.exe -precision 16
clblast_tuner_xgemm_direct.exe -precision 32
clblast_tuner_xgemm_direct.exe -precision 64
clblast_tuner_xgemm_direct.exe -precision 3232
clblast_tuner_xgemm_direct.exe -precision 6464
clblast_tuner_xgemm_direct.exe -precision 16
clblast_tuner_xgemv.exe -precision 32
clblast_tuner_xgemv.exe -precision 64
clblast_tuner_xgemv.exe -precision 3232
clblast_tuner_xgemv.exe -precision 6464
clblast_tuner_xgemv.exe -precision 16
clblast_tuner_invert.exe -precision 32
clblast_tuner_invert.exe -precision 64
clblast_tuner_invert.exe -precision 3232
clblast_tuner_invert.exe -precision 6464
clblast_tuner_invert.exe -precision 16
clblast_tuner_routine_xgemm.exe -precision 32
clblast_tuner_routine_xgemm.exe -precision 64
clblast_tuner_routine_xgemm.exe -precision 3232
clblast_tuner_routine_xgemm.exe -precision 6464
clblast_tuner_routine_xgemm.exe -precision 16
clblast_tuner_routine_xtrsv.exe -precision 32
clblast_tuner_routine_xtrsv.exe -precision 64
clblast_tuner_routine_xtrsv.exe -precision 3232
clblast_tuner_routine_xtrsv.exe -precision 6464
clblast_tuner_routine_xtrsv.exe -precision 16

jinchuantang · ‎05-29-2023

In case needing clinfo, currently, i am using 5700G APU, hence I took out the 580 GPU:

PS C:\Users\lenovo> clinfo
Number of platforms: 3
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3444.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 D3D12 Implementation
Platform Name: OpenCLOn12
Platform Vendor: Microsoft
Platform Extensions: cl_khr_icd cl_khr_extended_versioning cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program cl_khr_3d_image_writes cl_khr_gl_sharing cl_khr_gl_event
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 3.0 WINDOWS
Platform Name: Intel(R) OpenCL
Platform Vendor: Intel(R) Corporation
Platform Extensions: cl_khr_spirv_linkonce_odr cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon(TM) Graphics
Device Topology: PCI[ B#8, D#0, F#0 ]
Max compute units: 8
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 2000Mhz
Address bits: 64
Max memory allocation: 21664353484
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 42667343872
Constant buffer size: 21664353484
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 189517004
Max global variable size: 19497917952
Max global variable preferred total size: 42667343872
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FF819CE8000
Name: gfx90c
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3444.0 (PAL,HSAIL)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3444.0)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash cl_amd_copy_buffer_p2p cl_amd_planar_yuv

Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon(TM) Graphics
Device Topology: PCI[ B#8, D#0, F#0 ]
Max compute units: 8
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 2000Mhz
Address bits: 64
Max memory allocation: 21664353484
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 42667343872
Constant buffer size: 21664353484
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 189517004
Max global variable size: 19497917952
Max global variable preferred total size: 42667343872
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FF819CE8000
Name: gfx90c
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3444.0 (PAL,HSAIL)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3444.0)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash cl_amd_copy_buffer_p2p cl_amd_planar_yuv

Platform Name: OpenCLOn12
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Max compute units: 1
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 12Mhz
Address bits: 64
Max memory allocation: 1073741824
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 1024
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 34252650496
Constant buffer size: 65536
Max number of constant args: 15
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 80
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 000001F96587BB10
Name: AMD Radeon(TM) Graphics
Vendor: Microsoft
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.1.0
Profile: FULL_PROFILE
Version: OpenCL 1.2 D3D12 Implementation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program cl_khr_3d_image_writes cl_khr_gl_sharing cl_khr_gl_event

Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1414h
Max compute units: 1
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 12Mhz
Address bits: 64
Max memory allocation: 1073741824
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 1024
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 34252650496
Constant buffer size: 65536
Max number of constant args: 15
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 80
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 000001F96587BB10
Name: Microsoft Basic Render Driver
Vendor: Microsoft
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.1.0
Profile: FULL_PROFILE
Version: OpenCL 1.2 D3D12 Implementation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_il_program cl_khr_3d_image_writes cl_khr_gl_sharing cl_khr_gl_event

Platform Name: Intel(R) OpenCL
Number of devices: 1
Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 8086h
Max compute units: 16
Max work items dimensions: 3
Max work items[0]: 8192
Max work items[1]: 8192
Max work items[2]: 8192
Max work group size: 8192
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 32
Native vector width short: 16
Native vector width int: 8
Native vector width long: 4
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 0Mhz
Address bits: 64
Max memory allocation: 34252650496
Image support: Yes
Max number of images read arguments: 480
Max number of images write arguments: 480
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 480
Max size of kernel argument: 3840
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: No
Cache type: Read/Write
Cache line size: 64
Cache size: 524288
Global memory size: 68505300992
Constant buffer size: 131072
Max number of constant args: 480
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16383
Max pipe packet size: 1024
Max global variable size: 65536
Max global variable preferred total size: 65536
Max read/write image args: 480
Max on device events: 4294967295
Queue on device max size: 4294967295
Max on device queues: 4294967295
Queue on device preferred size: 4294967295
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: Yes
Atomics: Yes
Preferred platform atomic alignment: 64
Preferred global atomic alignment: 64
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 128
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 100
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 000001F96785E7B8
Name: AMD Ryzen 7 5700G with Radeon Graphics
Vendor: Intel(R) Corporation
Device OpenCL C version: OpenCL C 3.0
Driver version: 2022.13.3.0.16_160000
Profile: FULL_PROFILE
Version: OpenCL 3.0 (Build 0)
Extensions: cl_khr_spirv_linkonce_odr cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer

dipak · ‎05-30-2023

Hi @jinchuantang ,

Thanks for reporting this issue.

For 5700G APU, it will freeze while testing different parameters in the following commands then could not continue for all the parameter cases (3232 meanings complex-float numbers):

clbast_tuner_xgemm.exe -precision 3232

It would be helpful if you can please point to the related source and kernel files in the source code available here: https://github.com/CNugteren/CLBlast/releases/tag/1.6.0

Thanks.

jinchuantang · ‎05-30-2023

Dear Dipak,

many thanks!

tuning code:
src\tuning\kernels\xgemm.cpp
src\tuning\kernels\xgemm.hpp

xgemm code:
src\routines\level3\xgemm.cpp
src\routines\level3\xgemm.cpp

Corresponding kernels as included in the contructor Xgemm in src\routines\level3\xgemm.cpp：

// Constructor: forwards to base class constructor
template <typename T>
Xgemm<T>::Xgemm(Queue &queue, EventPointer event, const std::string &name):
    Routine(queue, event, name,
            {"Copy","Pad","Transpose","Padtranspose","Xgemm","XgemmDirect","GemmRoutine"},
            PrecisionValue<T>(), {}, {
    #include "../../kernels/level3/level3.opencl"
    #include "../../kernels/level3/copy_fast.opencl"
    #include "../../kernels/level3/copy_pad.opencl"
    #include "../../kernels/level3/transpose_fast.opencl"
    #include "../../kernels/level3/transpose_pad.opencl"
    #include "../../kernels/level3/convert_symmetric.opencl"
    #include "../../kernels/level3/convert_triangular.opencl"
    #include "../../kernels/level3/convert_hermitian.opencl"
    , // separated in multiple parts to prevent C1091 in MSVC 2013
    #include "../../kernels/level3/xgemm_direct_part1.opencl"
    #include "../../kernels/level3/xgemm_direct_part2.opencl"
    #include "../../kernels/level3/xgemm_direct_part3.opencl"
    , // separated in multiple parts to prevent C1091 in MSVC 2013
    #include "../../kernels/level3/xgemm_part1.opencl"
    #include "../../kernels/level3/xgemm_part2.opencl"
    , // separated in multiple parts to prevent C1091 in MSVC 2013
    #include "../../kernels/level3/xgemm_part3.opencl"
    #include "../../kernels/level3/xgemm_part4.opencl"
    }) {
}

dipak · ‎05-30-2023

Hi @jinchuantang ,

Thanks for the above information. I will look into it.

Thanks.

jinchuantang · ‎05-31-2023

Dear Dipak,

Thank you very much!

In the meantime, I would like to recall all readers here who have AMD GPUs to help run the tuner for clBlast since clBlast has been widely used in Python and many other applications to speedup AI and more. For Windows users, the detail is here: https://github.com/CNugteren/CLBlast/issues/1#issuecomment-1570253475 .

Best wishes,

Jinchuan Tang

dipak · ‎06-01-2023

Hi @jinchuantang ,

Just to let you know, I have filed an internal bug ticket for this issue.

Thanks.

dipak · ‎07-28-2023

Hi @jinchuantang ,

The OpenCL team tried to reproduce the issue for the following cases on Cezanne APU, however they didn't observe any system crash or freeze. Instead, they observed TDR for both the cases.

clbast_tuner_xgemm.exe -precision 3232
clbast_tuner_xgemm.exe -precision 6464

I have attached the json files generated during the test. Could you please check the files and let us know if it is the expected output?

Another point, from the above clinfo output, it looks like the setup has multiple platforms. As the OpenCL team has asked, could you please try the following step and share your observation?
- disable or remove any non-AMD platform, run on AMD platform alone with the latest Adrenalin driver (please make sure to remove default entry as any non-AMD platform so that AMD driver can be loaded)

Thanks.

jinchuantang · ‎07-29-2023

Dear Dipak,

Starting from line 205 of clblast_xgemm_12_3232.json in json-3232.zip (use VS code to open the json file), I observe that the time to run the setting dropped suddenly from 1705.457 to 0.054 and later many instances in with the same level as 0.054. Therefore, I believe this is where the TDR happened. This will cause us to falsely identify to be the best setting for matrix multiplication for whichever has the least running time. In the meantime, from line 1725 of clblast_xgemm_11_6464.json, the same TDR thing happened. This may explain why there is no such file called clblast_xgemm_12_6464.json since the OpenCL driver is already stop working after recovery from TDR.

With the best intention, I wish OpenCL team will help solve this problem since we have never encountered this on a 32 bit Intel/Apple Silicons as well as Nvidia GPUs (You can kindly find the numerous number of Intel/APPLE/AMD/Nvidia devices me and the communities reported to CLBlast on its GitHub).

Best wishes,
Jinchuan

Take the following for examples:

@jinchuantang wrote:
Dear Dipak,
Recently, I am trying to produce some tuning results on AMD/NVIDIA/INTEL/APPLE GPUs for the CLBlast library (famous for fast GEMM) so that the community could receive the best performance without further tuning. However, in some cases, the latest released drivers (date: 2023/5/24) for 5700G APU (Shared 16GB memory) and RX 580 8GB 2048SP GPU will freeze for some tuning cases. The tunning here is the thing where CLBlast uses different parameters for GEMM and other Blas 1/2/3/x kernels and compiles them to test the performance in terms of GFLOPs and so forth.
For 5700G APU, it will freeze while testing different parameters in the following commands then could not continue for all the parameter cases (3232 meanings complex-float numbers):
clbast_tuner_xgemm.exe -precision 3232
it also freezes and then produces some erratic results for some final cases while executing (6464 means complex-double cases):
clbast_tuner_xgemm.exe -precision 6464
For RX580 GPU, it will freeze and could not continue to produce all the results for different parameters:
clbast_tuner_xgemm.exe -precision 6464
Regardless of those, I have still reported the results to CLBlast so it could use available tunning results for its future release:
New tuning results · Issue #1 · CNugteren/CLBlast (github.com)
In the meantime, to produce the results:
1. Goto "github.com/CNugteren/CLBlast/releases/tag/1.6.0" to download CLBlast-1.6.0-windows-x64.7z
2. Copy clblast.dll in the lib directory to bin directory
3. execute "clbast_tuner_xgemm.exe -precision 3232" command in the command line (not powershell).
In addition, the followings are all the test cases in case you would like to test all of them (you can create a .bat file and put the commands inside):
Best wishes,
Jinchuan Tang

bat file:
clblast_tuner_copy_fast.exe -precision 32
clblast_tuner_copy_fast.exe -precision 64
clblast_tuner_copy_fast.exe -precision 3232
clblast_tuner_copy_fast.exe -precision 6464
clblast_tuner_copy_fast.exe -precision 16
clblast_tuner_copy_pad.exe -precision 32
clblast_tuner_copy_pad.exe -precision 64
clblast_tuner_copy_pad.exe -precision 3232
clblast_tuner_copy_pad.exe -precision 6464
clblast_tuner_copy_pad.exe -precision 16
clblast_tuner_transpose_fast.exe -precision 32
clblast_tuner_transpose_fast.exe -precision 64
clblast_tuner_transpose_fast.exe -precision 3232
clblast_tuner_transpose_fast.exe -precision 6464
clblast_tuner_transpose_fast.exe -precision 16
clblast_tuner_transpose_pad.exe -precision 32
clblast_tuner_transpose_pad.exe -precision 64
clblast_tuner_transpose_pad.exe -precision 3232
clblast_tuner_transpose_pad.exe -precision 6464
clblast_tuner_transpose_pad.exe -precision 16
clblast_tuner_xaxpy.exe -precision 32
clblast_tuner_xaxpy.exe -precision 64
clblast_tuner_xaxpy.exe -precision 3232
clblast_tuner_xaxpy.exe -precision 6464
clblast_tuner_xaxpy.exe -precision 16
clblast_tuner_xdot.exe -precision 32
clblast_tuner_xdot.exe -precision 64
clblast_tuner_xdot.exe -precision 3232
clblast_tuner_xdot.exe -precision 6464
clblast_tuner_xdot.exe -precision 16
clblast_tuner_xger.exe -precision 32
clblast_tuner_xger.exe -precision 64
clblast_tuner_xger.exe -precision 3232
clblast_tuner_xger.exe -precision 6464
clblast_tuner_xger.exe -precision 16
clblast_tuner_xgemm.exe -precision 32
clblast_tuner_xgemm.exe -precision 64
clblast_tuner_xgemm.exe -precision 3232
clblast_tuner_xgemm.exe -precision 6464
clblast_tuner_xgemm.exe -precision 16
clblast_tuner_xgemm_direct.exe -precision 32
clblast_tuner_xgemm_direct.exe -precision 64
clblast_tuner_xgemm_direct.exe -precision 3232
clblast_tuner_xgemm_direct.exe -precision 6464
clblast_tuner_xgemm_direct.exe -precision 16
clblast_tuner_xgemv.exe -precision 32
clblast_tuner_xgemv.exe -precision 64
clblast_tuner_xgemv.exe -precision 3232
clblast_tuner_xgemv.exe -precision 6464
clblast_tuner_xgemv.exe -precision 16
clblast_tuner_invert.exe -precision 32
clblast_tuner_invert.exe -precision 64
clblast_tuner_invert.exe -precision 3232
clblast_tuner_invert.exe -precision 6464
clblast_tuner_invert.exe -precision 16
clblast_tuner_routine_xgemm.exe -precision 32
clblast_tuner_routine_xgemm.exe -precision 64
clblast_tuner_routine_xgemm.exe -precision 3232
clblast_tuner_routine_xgemm.exe -precision 6464
clblast_tuner_routine_xgemm.exe -precision 16
clblast_tuner_routine_xtrsv.exe -precision 32
clblast_tuner_routine_xtrsv.exe -precision 64
clblast_tuner_routine_xtrsv.exe -precision 3232
clblast_tuner_routine_xtrsv.exe -precision 6464
clblast_tuner_routine_xtrsv.exe -precision 16

jinchuantang · ‎07-29-2023

For the second question, I am still running the runner for the old 3444 OpenCL driver as well as the latest one.

jinchuantang · ‎07-29-2023

Tried both 3444 and 3570 OpenCL 2.1 AMD-APP (3570.0), they both could pass all the settings without triggering TDR.

jinchuantang · ‎07-29-2023

Sorry for the typos. They both could not pass all the settings without triggering TDR.

dipak · ‎07-31-2023

Hi @jinchuantang ,

Thanks for checking the output files and sharing your observation.

>>Tried both 3444 and 3570 OpenCL 2.1 AMD-APP (3570.0) , they both could not pass all the settings without triggering TDR.

Just to confirm, did you get this observation when using AMD platform alone?

If not, then it would be helpful if you please try the steps below as suggested by the OpenCL team.

Another point, from the above clinfo output, it looks like the setup has multiple platforms. As the OpenCL team has asked, could you please try the following step and share your observation?
- disable or remove any non-AMD platform, run on AMD platform alone with the latest Adrenalin driver (please make sure to remove default entry as any non-AMD platform so that AMD driver can be loaded)

Thanks.

jinchuantang · ‎07-31-2023

Dear Dipak, I strictly follow the advice— Single AMD platform only and double checked with clinfo.

dipak · ‎07-31-2023

Hi @jinchuantang ,

Thanks for confirming it. I will share your observation with the OpenCL team.

Could you please provide the clinfo output for AMD platform only?

Thanks.

jinchuantang · ‎07-31-2023

Microsoft Windows [版本 10.0.22621.2070]
(c) Microsoft Corporation。保留所有权利。

C:\Users\lenovo>clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3570.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon(TM) Graphics
Device Topology: PCI[ B#8, D#0, F#0 ]
Max compute units: 8
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 2000Mhz
Address bits: 64
Max memory allocation: 21493225881
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 25823019008
Constant buffer size: 21493225881
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 18389401
Max global variable size: 19343903232
Max global variable preferred total size: 25823019008
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FF801A46490
Name: gfx90c
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3570.0 (PAL,HSAIL)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3570.0)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv

Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon(TM) Graphics
Device Topology: PCI[ B#8, D#0, F#0 ]
Max compute units: 8
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 2000Mhz
Address bits: 64
Max memory allocation: 21493225881
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 25823019008
Constant buffer size: 21493225881
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 18389401
Max global variable size: 19343903232
Max global variable preferred total size: 25823019008
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FF801A46490
Name: gfx90c
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3570.0 (PAL,HSAIL)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3570.0)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv

C:\Users\lenovo>

dipak · ‎07-31-2023

Thanks for sharing the clinfo output.

dipak · ‎08-02-2023

The OpenCL team has informed that they didn't observe any issue for both the cases when TDR was disabled. The issue has been forwarded to the base driver team for further investigation.

Meanwhile, could you please run those cases after disabling TDR and share your observation?

Thanks.

jinchuantang · ‎08-07-2023

Dear Dipak,

With the just finished results, I can confirm that disabling TDR with TdrLevel = 0 at regedit in Windows works for 5700G APU. It went through all the GEMM tuning cases for 3232 and 6464.

Best wishes,

Jinchuan

dipak · ‎08-08-2023

Hi @jinchuantang,

Thanks for sharing your observation.

As I said in my previous post, the issue has been forwarded to the base driver team for further investigation. I will let you know once I get any update on this.

Thanks.

OpenCL

Driver freezing and produce wrong results while using CLBlast tunner