cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

jinchuantang
Adept II

Whitelist request and A potential OpenCL driver problem with AMD 5700G APUs

Dear ROCm team and other OpenCL experts,

I need your help on using your APUs with OpenCL. Today I installed a new APU platform with AMD's 5700G with the intention to use its OpenCL functionality with Octave and Octave-ocl (https://sourceforge.net/projects/octave-ocl/files/). Octave is well known as an opensource alternative to Matlab, while Octave-ocl enables the similar gpuArray functionality in Octave with OpenCL devices as compared to CUDA only in Matlab. I was a tiny contributor to octave-ocl with some of my testing and coding as well as my own fork page (https://sourceforge.net/u/tangjinchuan/octave-ocl-gzu/ci/default/tree/). However, the problem I face with AMD's APU just above my limit in time.

The problem is that any time I try to use AMD's platform driver (for example, OpenCL 2.1 AMD-APP (3302.6)), the Octave program will crash.

You can try install Octave 6.3 (select a directory without any space, otherwise there will be problem), and pkg install the Octave-ocl tar.gz file.

For example:

>>pkg install ocl-1.1.1.tar.gz

% this will install the ocl pkg in octave

>>pkg load ocl

% this will load ocl pkg

>>gpuArray(1,5)

% this will generate a GPU array with 1 row, 5 colums of ones. BUt it will make the Octave program crash.

My experience:
I had no problems with ocl in Nvidia (e.g. RTX 3080), Intel CPUs/GPUs. But 5700G will give a crash. (and I remembered a very long time ago, I borrowed another older amd APUs, it also crashed the program). More info on different drivers, please see my page: https://sourceforge.net/u/tangjinchuan/wiki/browse_pages/

I had no problem with AMD's 5700G CPUs (Yes, its CPU part even though AMD dropped the Windows OpenCL support for CPUs ) even if I choose Intel's OpenCL runtime to run any OpenCL tasks.

I had no problem with AMD's 5700G GPUs if I choose Microsoft's OpenCL on DX12 runtime to run any 32bit OpenCL tasks (Unfortunately, it does not support 64bits).

To select the opencl driver, we can use >> ocl_context ("device_selection", 'GPU0')

GPUn represent the nth GPU driver, while ocl_context ("device_selection", 'CPU') ask to run CPUs. Also U can use

I tried to find the problem with Windows cmd, it gives the following results:

Microsoft Windows [Version 10.0.19043.1202]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Owner>octave
'octave' is not recognized as an internal or external command,
operable program or batch file.

C:\Users\Owner>cd C:\Octave-6.3.0\mingw64\bin

C:\Octave-6.3.0\mingw64\bin>octave-cli-6.3.0.exe
GNU Octave, version 6.3.0
Copyright (C) 2021 The Octave Project Developers.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. For details, type 'warranty'.

Octave was configured for "x86_64-w64-mingw32".

Additional information about Octave is available at https://www.octave.org.

Please contribute if you find this software useful.
For more information, visit https://www.octave.org/get-involved.html

Read https://www.octave.org/bugs.html to learn how to submit bug reports.
For information about changes from previous versions, type 'news'.

octave:1> pkd load ocl
error: 'pkd' undefined near line 1, column 1
octave:2> pkg load ocl
octave:3> ocl_ones(1,5)
error: Invalid record (Producer: 'LLVM3.9.0svn' Reader: 'LLVM 3.9.0svn')

C:\Octave-6.3.0\mingw64\bin>octave-cli-6.3.0.exe

Also, my platform info get from Octave-ocl

>> ocl_context('get_resources')
ans =

scalar structure containing the fields:

platforms =
{
[1,1] =

scalar structure containing the fields:

platform_index = 0
name = AMD Accelerated Parallel Processing
version = OpenCL 2.1 AMD-APP (3302.6)
profile = FULL_PROFILE
vendor = Advanced Micro Devices, Inc.
extensions = cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offl
ine_devices

[2,1] =

scalar structure containing the fields:

platform_index = 1
name = OpenCLOn12
version = OpenCL 1.2 D3D12 Implementation
profile = FULL_PROFILE
vendor = Microsoft
extensions = cl_khr_icd

[3,1] =

scalar structure containing the fields:

platform_index = 2
name = Intel(R) OpenCL
version = OpenCL 2.1 WINDOWS
profile = FULL_PROFILE
vendor = Intel(R) Corporation
extensions = cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomic
s cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_dep
th_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char c
l_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread
cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer

}

devices =
{
[1,1] =
{
[1,1] =

scalar structure containing the fields:

platform_index = 0
device_index = 0
name = gfx90c
vendor = Advanced Micro Devices, Inc.
type = 4
version =

scalar structure containing the fields:

driver: 1x18 sq_string
device: 1x27 sq_string
opencl_c: 1x13 sq_string
profile: 1x12 sq_string
vendorid: 1x1 scalar

compute =

scalar structure containing the fields:

units: 1x1 scalar
max_dimension: 1x1 scalar
max_workgroup_size: 1x1 scalar
max_workitems_size: 1x3 matrix
clock_frequency: 1x1 scalar

mem =

scalar structure containing the fields:

global: 1x1 scalar struct
local: 1x1 scalar struct
const: 1x1 scalar struct
param: 1x1 scalar struct
address_bits: 1x1 scalar
align: 1x1 scalar struct
little_endian: 1x1 scalar
host_unified: 1x1 scalar
vector_width: 1x1 scalar struct

caps =

scalar structure containing the fields:

device_available: 1x1 scalar
compiler_available: 1x1 scalar
queue_props: 1x1 scalar
execution: 1x1 scalar
profile_timer_res: 1x1 scalar
error_correction: 1x1 scalar
half: 1x1 scalar struct
single: 1x1 scalar struct
double: 1x1 scalar struct
images: 1x1 scalar struct
extensions: 1x683 sq_string


}

[2,1] =
{
[1,1] =

scalar structure containing the fields:

platform_index = 1
device_index = 0
name = AMD Radeon(TM) Graphics
vendor = Microsoft
type = 4
version =

scalar structure containing the fields:

driver: 1x5 sq_string
device: 1x31 sq_string
opencl_c: 1x13 sq_string
profile: 1x12 sq_string
vendorid: 1x1 scalar

compute =

scalar structure containing the fields:

units: 1x1 scalar
max_dimension: 1x1 scalar
max_workgroup_size: 1x1 scalar
max_workitems_size: 1x3 matrix
clock_frequency: 1x1 scalar

mem =

scalar structure containing the fields:

global: 1x1 scalar struct
local: 1x1 scalar struct
const: 1x1 scalar struct
param: 1x1 scalar struct
address_bits: 1x1 scalar
align: 1x1 scalar struct
little_endian: 1x1 scalar
host_unified: 1x1 scalar
vector_width: 1x1 scalar struct

caps =

scalar structure containing the fields:

device_available: 1x1 scalar
compiler_available: 1x1 scalar
queue_props: 1x1 scalar
execution: 1x1 scalar
profile_timer_res: 1x1 scalar
error_correction: 1x1 scalar
half: 1x1 scalar struct
single: 1x1 scalar struct
double: 1x1 scalar struct
images: 1x1 scalar struct
extensions: 1x168 sq_string


[2,1] =

scalar structure containing the fields:

platform_index = 1
device_index = 1
name = Microsoft Basic Render Driver
vendor = Microsoft
type = 4
version =

scalar structure containing the fields:

driver: 1x5 sq_string
device: 1x31 sq_string
opencl_c: 1x13 sq_string
profile: 1x12 sq_string
vendorid: 1x1 scalar

compute =

scalar structure containing the fields:

units: 1x1 scalar
max_dimension: 1x1 scalar
max_workgroup_size: 1x1 scalar
max_workitems_size: 1x3 matrix
clock_frequency: 1x1 scalar

mem =

scalar structure containing the fields:

global: 1x1 scalar struct
local: 1x1 scalar struct
const: 1x1 scalar struct
param: 1x1 scalar struct
address_bits: 1x1 scalar
align: 1x1 scalar struct
little_endian: 1x1 scalar
host_unified: 1x1 scalar
vector_width: 1x1 scalar struct

caps =

scalar structure containing the fields:

device_available: 1x1 scalar
compiler_available: 1x1 scalar
queue_props: 1x1 scalar
execution: 1x1 scalar
profile_timer_res: 1x1 scalar
error_correction: 1x1 scalar
half: 1x1 scalar struct
single: 1x1 scalar struct
double: 1x1 scalar struct
images: 1x1 scalar struct
extensions: 1x168 sq_string


}

[3,1] =
{
[1,1] =

scalar structure containing the fields:

platform_index = 2
device_index = 0
name = AMD Ryzen 7 5700G with Radeon Graphics
vendor = Intel(R) Corporation
type = 2
version =

scalar structure containing the fields:

driver: 1x21 sq_string
device: 1x20 sq_string
opencl_c: 1x13 sq_string
profile: 1x12 sq_string
vendorid: 1x1 scalar

compute =

scalar structure containing the fields:

units: 1x1 scalar
max_dimension: 1x1 scalar
max_workgroup_size: 1x1 scalar
max_workitems_size: 1x3 matrix
clock_frequency: 1x1 scalar

mem =

scalar structure containing the fields:

global: 1x1 scalar struct
local: 1x1 scalar struct
const: 1x1 scalar struct
param: 1x1 scalar struct
address_bits: 1x1 scalar
align: 1x1 scalar struct
little_endian: 1x1 scalar
host_unified: 1x1 scalar
vector_width: 1x1 scalar struct

caps =

scalar structure containing the fields:

device_available: 1x1 scalar
compiler_available: 1x1 scalar
queue_props: 1x1 scalar
execution: 1x1 scalar
profile_timer_res: 1x1 scalar
error_correction: 1x1 scalar
half: 1x1 scalar struct
single: 1x1 scalar struct
double: 1x1 scalar struct
images: 1x1 scalar struct
extensions: 1x587 sq_string


}

}

summary =
{
[1,1] =

scalar structure containing the fields:

type = GPU
fp64 = 1
version = 2
platform_index = 0
device_index = 0
name = gfx90c

[2,1] =

scalar structure containing the fields:

type = CPU
fp64 = 1
version = 2
platform_index = 2
device_index = 0
name = AMD Ryzen 7 5700G with Radeon Graphics

[3,1] =

scalar structure containing the fields:

type = GPU
fp64 = 0
version = 1
platform_index = 1
device_index = 0
name = AMD Radeon(TM) Graphics

[4,1] =

scalar structure containing the fields:

type = GPU
fp64 = 0
version = 1
platform_index = 1
device_index = 1
name = Microsoft Basic Render Driver

}


>>


Best wishes,

Jinchuan

0 Likes
1 Solution

Dear Dipak,

After using this Saturday, I have successfully located the errors. It is a platform-related problem regarding memory coherency when executing multiple kernels one by one. The problem is that when using gpuArray to index another gpuArray, it is necessary to deduction all the indices by one due to the fact that C/OpenCL language and Octave language are zero-based and one-based formats. The code here: https://sourceforge.net/p/octave-ocl/code/ci/default/tree/src/ocl_ov_matrix.cc#l324

issue a kernel to do the job of deduction by one, but somehow created either the correct new array with deduction or a wrong original array without deduction if executing later line instantly (To produce such problem with an independent host code is very hard). And somehow, Intel's and Nvidia have avoided such a problem in a short round test (they may have the same problems but limited tests could not show this). I guess there should be more work to do to check the coherency of the results across all platforms in the Octave-ocl. 

Now I have created new kernels to handle the case above which combine deduction and indexing at the same time to avoid this problem. After this, it works fully now, and AMD's 5700G APU passes all ocl_tests for the first time. I will announce AMD APU/Intel APU/Nvidia GPU formatted Octave-ocl package on my SourceForge wiki page later (https://sourceforge.net/u/tangjinchuan/wiki/Can%20I%20use%20AMD%27s%20APUs%20in%20Windows%3F/). Before that, there needs more testing this weekend on different platforms. 

Thank you very much for all the support!

Best wishes,

Jinchuan Tang

View solution in original post

23 Replies
jinchuantang
Adept II

My last post did not show the way to locate the errors fully.

Here is the step:

Since the GUI of Octave will disappear if executing any OpenCL related statement such as gpuArray(1,5), I tried to find what did the Octave returns before it crashed. Hence, I tried to use the CLI of Octave in cmd of Windows 10. 

1. The file should be in "C:\Octave-6.3.0\mingw64\bin>octave-cli-6.3.0.exe" or similar location. 

2. type in C:\Octave-6.3.0\mingw64\bin>octave-cli-6.3.0.exe in the cmd of Windows.

3. Now we can type any Octave related cmd. 

4. >> pkg load ocl

5. line 4 will load the ocl pkg

6. >>gpuArray(1,5)

7. At this point the cmd will return an LLVM related errors with AMD's OpenCL drivers.

0 Likes

If you are using ROCm then you can also post at Github ROCm Forum from here: https://github.com/RadeonOpenCompute/ROCm/issues

0 Likes

Thanks! Have submitted to ROCm Github. By the way, to Dr Liza Su and the sales team, this year when my team ordered 100k British pounds worth of equipment including severs and working stations. I only ordered 5 AMD 4600G equipped Huawei B515 PCs + 1 5700G CPU. For the rest, I have to order Intel CPUs and Nvidia 3090/3080Ti GPUs due to reasons that the bugs (like this one)/lack of features (such as CUDA which is important to Matlab and other neural network jobs, we strive to replace CUDA in Octave) that me and my colleagues faced with AMD's CPUs/GPUs. I wish AMD would boost more success stories in  scientific community rather than GAMES only and I would say more AMD yeses in the future!

0 Likes

Hi @jinchuantang ,

Thank you for reporting the issue. I have whitelisted you and moved the post to the OpenCL forum. 

We will look into this and get back to you. Please share the driver details and clinfo output.

 

Thanks.

 

 

 

 

0 Likes

As I can see, the ROCm support team has already replied here:  A potential OpenCL driver problem with AMD 5700G APUs

0 Likes

Thanks! I guess that is the answer for ROCm. Does AMD use ROCm as windows's OpenCL driver? But AMD's APU clearly stated its support for OpenCL 2.0  in the Radeon hardware details. If this is the case for windows also, I guess the options for me and my teams are quite clear now.  Gosh! I am glad I did the right choice mostly with N cards and I CPUs+Integrated Graphics, maybe I ARC will be a good choice for us in the future.

0 Likes

As per the below driver page, the latest AMD graphics driver supports this APU on Windows. If you are using this latest driver, please provide the clinfo out. 

https://www.amd.com/en/support/apu/amd-ryzen-processors/amd-ryzen-7-5000-g-series-desktop-processors...

Do you observe this crash issue with any OpenCL program or is it specific to "Octave" application only?

 

 

 

 

0 Likes

Dear Dipak,

thank you very much for still chasing it! It seems to be only specific to Octave. It feels like the driver is rejecting any cmd from Octave.  The driver is the latest 21.10.2. I tried luxmark like before, it works on both 5700G and the old Vega APU in the past. So, it is somehow a problem from the Octave-ocl, which triggers AMD's driver to be a problem but not Intel or Nvidia's. If you believe there is something that needs to tread carefully with AMD's driver, please let me know. I can modify the Octace-ocl's OpenCL program to do this carefully. By the way, Octave-ocl calls OpenCL version 1.1 in order to embrace more devices. And I believe OpenCL 3.0 standards accept anything before version 1.2 without any change (which is the case for Intel neo or Nvidia OpenCL 3.0 drivers). 

 

Microsoft Windows [Version 10.0.22000.258]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Owner>clinfo
Number of platforms: 3
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3302.6)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 D3D12 Implementation
Platform Name: OpenCLOn12
Platform Vendor: Microsoft
Platform Extensions: cl_khr_icd
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 WINDOWS
Platform Name: Intel(R) OpenCL
Platform Vendor: Intel(R) Corporation
Platform Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer


Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon(TM) Graphics
Device Topology: PCI[ B#48, D#0, F#0 ]
Max compute units: 8
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 2000Mhz
Address bits: 64
Max memory allocation: 16221025075
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 36263428096
Constant buffer size: 16221025075
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 3336123187
Max global variable size: 14598922496
Max global variable preferred total size: 36263428096
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FF9603E3490
Name: gfx90c
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3302.6 (PAL,HSAIL)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3302.6)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_khr_gl_depth_images cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash cl_amd_copy_buffer_p2p cl_amd_planar_yuv


Platform Name: OpenCLOn12
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Max compute units: 1
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 12Mhz
Address bits: 64
Max memory allocation: 1073741824
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 1024
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 25714018304
Constant buffer size: 65536
Max number of constant args: 15
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 80
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0000020D6F523D20
Name: AMD Radeon(TM) Graphics
Vendor: Microsoft
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.0.0
Profile: FULL_PROFILE
Version: OpenCL 1.2 D3D12 Implementation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store


Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1414h
Max compute units: 1
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 12Mhz
Address bits: 64
Max memory allocation: 1073741824
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 1024
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 25714018304
Constant buffer size: 65536
Max number of constant args: 15
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 80
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0000020D6F523D20
Name: Microsoft Basic Render Driver
Vendor: Microsoft
Device OpenCL C version: OpenCL C 1.2
Driver version: 1.0.0
Profile: FULL_PROFILE
Version: OpenCL 1.2 D3D12 Implementation
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store


Platform Name: Intel(R) OpenCL
Number of devices: 1
Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 8086h
Max compute units: 16
Max work items dimensions: 3
Max work items[0]: 8192
Max work items[1]: 8192
Max work items[2]: 8192
Max work group size: 8192
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 32
Native vector width short: 16
Native vector width int: 8
Native vector width long: 4
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 0Mhz
Address bits: 64
Max memory allocation: 12857009152
Image support: Yes
Max number of images read arguments: 480
Max number of images write arguments: 480
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 480
Max size of kernel argument: 3840
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: No
Cache type: Read/Write
Cache line size: 64
Cache size: 524288
Global memory size: 51428036608
Constant buffer size: 131072
Max number of constant args: 480
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16383
Max pipe packet size: 1024
Max global variable size: 65536
Max global variable preferred total size: 65536
Max read/write image args: 480
Max on device events: 4294967295
Queue on device max size: 4294967295
Max on device queues: 4294967295
Queue on device preferred size: 4294967295
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: Yes
Atomics: Yes
Preferred platform atomic alignment: 64
Preferred global atomic alignment: 64
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 128
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 100
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0000020D6D458AF8
Name: AMD Ryzen 7 5700G with Radeon Graphics
Vendor: Intel(R) Corporation
Device OpenCL C version: OpenCL C 2.0
Driver version: 2021.12.6.0.19_160000
Profile: FULL_PROFILE
Version: OpenCL 2.1 (Build 0)
Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_khr_il_program cl_intel_unified_shared_memory_preview cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_spirv_subgroups cl_intel_required_subgroup_size cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer

 

C:\Users\Owner>

0 Likes

Please find the cl info there for the web system rejects the message twice:https://sourceforge.net/u/tangjinchuan/wiki/Can%20I%20use%20AMD%27s%20APUs%20in%20Windows%3F/

I tried luxmark, there is no problem.

0 Likes

Hi @jinchuantang 

Thank you for providing the clinfo output.

I was just trying to reproduce the issue as a standalone OpenCL program using the below kernel code and observed a similar LLVM error.

https://sourceforge.net/p/octave-ocl/code/ci/default/tree/src/ocl_array_prog.cc#l582

[Note: Running the octave program with the OCL compiler dump option shows complier option "-DTYPE=double -DFLOATINGPOINT", so I set this option in the test program while building the kernel]

As per my finding, the below code section seems to be causing the LLVM error.

__kernel void \n\

ocl_max \n\

(__global TYPE *data_dst1, \n\

__global IDX_T *data_dst2, \n\

...)
{
    ....
    if (data_dst2 != data_dst1) \n  <------ this line
        data_dst2 [i] = (IDX_T) (km); \n\
}

After modifying that comparison statement as shown below, the test-program ran fine without any error. [Note, there are multiple statements like this in the above kernel file]

if (data_dst2 != (__global IDX_T*)data_dst1) 

Could you please try the above modification in the octave ocl  kernel file to see if it works on your setup?

 

Thanks.

Dear Dipak,

that’s great news! I will do it now even though the time is 23:00 at my place. I remember others(Intel &Nvidia) will generate warnings when they compile those lines, but the original implementations of the kernels have indeed a bad style for those lines. I will let you know the results ASAP.

 

0 Likes

Dear Dipak,

I have replaced all the corresponding lines in the kernels, and it worked!!! The installable pkg file can be found here in the end: https://sourceforge.net/u/tangjinchuan/wiki/Can%20I%20use%20AMD%27s%20APUs%20in%20Windows%3F/

I can't believe that I informed the original Octave-Ocl author Matt to fix the code that will cause Intel's runtime to have problems (https://sourceforge.net/p/octave-ocl/code/ci/9b94e8c1545e6494fd2524e4e06a692d295e7f35/log/), now you helped us fix the bug so that it can run on AMD's devices (at least APUs for I really do not have AMD discrete GPUs). I tried to reach Matt a very long time ago about this, and he did not reply to me while fixing some other problems I reported previously.

A moment ago, I also tried the  ocl_tests cmd  after  pkg load ocl  in the Octave. Although it did not pass the tests fully, and I guess it would be a problem as important as this. I will try to reach Matt to have a look at his test examples and relax some tolerance if it is appropriate for different devices. By the way, I do have a plan (maybe not fully realizable due to a busy schedule) to port the AMD's "well-done" clFFT as well as Intel's improved FFT implementations for specific scenarios to Octave-Ocl. Maybe this would be done by one of my students as long as someone who is bold enough to choose this topic for his/her Bachelor dissertation.

Thank you very much! 

Best wishes,

Jinchuan Tang 

 

It's good to hear that the Octave-Ocl kernel is working now. Thanks for confirming it.

Thanks.

 

Dear Dipak,

I have tested all the cmds in ocl_tests offered by Octave ocl. The driver passed all tests on: ocl_constant, ocl_lib, ocl_context, constructor, utility function, operator, dimension-wise (math), mapping (math) function, ocl program data type.

However, it fails randomly on index related tests.

I have upload the tests file at the end of the page https://sourceforge.net/u/tangjinchuan/wiki/Can%20I%20use%20AMD%27s%20APUs%20in%20Windows%3F/ for your reference. 

There is no problem if I run this part with Microsoft's OpenCL on Dx12 runtime.

Thank you very much!

Best wishes,

Jinchuan

0 Likes

Hi @jinchuantang ,

However, it fails randomly on index related tests.

Did you observe any opencl compilation/runtime error while running the above test?

As I am not familiar with the Octave ocl package, it would be helpful if you can please point to the related kernel/runtime code which can be used to reproduce the issue. 

Thanks.

ocl_index_op:
https://sourceforge.net/p/octave-ocl/code/ci/default/tree/src/ocl_ov_matrix.cc#l304
index function:
https://sourceforge.net/p/octave-ocl/code/ci/default/tree/src/ocl_array.cc#l594
ocl_index kernel:
https://sourceforge.net/p/octave-ocl/code/ci/default/tree/src/ocl_array_prog.cc#l238

By my understanding, "ocl_index_op" aims to sort out different indexing scenarios, while "index" function tries to call the kernel "ocl_index".

Thank you for sharing to the above links.

To investigate the OpenCL driver related issue, we need a minimal test-case that reproduces the issue. From the above links, it looks like the opencl host-code (i.e. "index " function where "ocl_index" kernel is enqueued) has dependency on other user-defined structures. So, I'm not sure how to isolate the related host-code to create a test-case for this issue. Can you please provide a wrapper host-code to verify the issue?

Thanks.

Dear Dipak,

After using this Saturday, I have successfully located the errors. It is a platform-related problem regarding memory coherency when executing multiple kernels one by one. The problem is that when using gpuArray to index another gpuArray, it is necessary to deduction all the indices by one due to the fact that C/OpenCL language and Octave language are zero-based and one-based formats. The code here: https://sourceforge.net/p/octave-ocl/code/ci/default/tree/src/ocl_ov_matrix.cc#l324

issue a kernel to do the job of deduction by one, but somehow created either the correct new array with deduction or a wrong original array without deduction if executing later line instantly (To produce such problem with an independent host code is very hard). And somehow, Intel's and Nvidia have avoided such a problem in a short round test (they may have the same problems but limited tests could not show this). I guess there should be more work to do to check the coherency of the results across all platforms in the Octave-ocl. 

Now I have created new kernels to handle the case above which combine deduction and indexing at the same time to avoid this problem. After this, it works fully now, and AMD's 5700G APU passes all ocl_tests for the first time. I will announce AMD APU/Intel APU/Nvidia GPU formatted Octave-ocl package on my SourceForge wiki page later (https://sourceforge.net/u/tangjinchuan/wiki/Can%20I%20use%20AMD%27s%20APUs%20in%20Windows%3F/). Before that, there needs more testing this weekend on different platforms. 

Thank you very much for all the support!

Best wishes,

Jinchuan Tang

It's nice to hear that the issue has been resolved. I really appreciate your effort to identify the memory coherency problem and fix it. 

Thanks.

Dear dipak,

I would like to express another doubt on the coherency problem. As we know, I have added a workaround to make the deduction of index (Matlab based index to C language based index) and the latter operation as one kernel instead of two, as a result octave ocl can work fully on the AMD devices. As we know the kernel executions are assumed to be in the order of submission in front of programmers, while the driver may optimize to execute them in parallel as long as two kernels won't affect each other's results. 

At that time, I doubt that Intel/Nvidia may have a potential problem, but this did not happen. This implies that they both handle the order of execution correctly, could it be possible that something is wrong at AMD's driver for the compiler did not detect that two kernels are related and somehow wrongly optimized to execute in parallel (maybe I am wrong completely). I have this doubt as soon as I discovered that APPLE's M1 machine (I don't know if their GPU is from Imagination Tech. or their in-house product or others) can also execute the index related test correctly without the workaround. 

I was not comfortable about this as soon as I discovered that only AMD behaves differently to this code, so I guess it would be better for you guys to know. Any suggestion would be highly appreciated.

P.S. Octave's author Matt was asking me about my modification to support AMD's devices, I guess it would be more cautious to discuss this doubt with you guys before doing any further steps with regarding to add code to main branch to support AMD devices.

Best wishes,

Jinchuan

0 Likes

Hi Jinchuan,

As we know the kernel executions are assumed to be in the order of submission in front of programmers, while the driver may optimize to execute them in parallel as long as two kernels won't affect each other's results. 

Yes, if the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. So, if the application enqueues kernel A before kernel B, then

1) the application can assume that kernel A finishes first and then kernel B is executed

2) if the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A. 

[clCreateCommandQueue.html]

In OpenCL, command synchronization is defined in terms of distinct synchronization points (e.g. completion of a command, command-queue barrier, blocking commands,  clFinish etc.). The synchronization points are very important for memory-consistency and portability of an OpenCL application. As the 3.2.4. Execution Model: Synchronization says:

"A synchronization point between a pair of commands (A and B) assures that results of command A happens-before command B is launched. This requires that any updates to memory from command A complete and are made available to other commands before the synchronization point completes. Likewise, this requires that command B waits until after the synchronization point before loading values from global memory. "

 

could it be possible that something is wrong at AMD's driver for the compiler did not detect that two kernels are related and somehow wrongly optimized to execute in parallel (maybe I am wrong completely).

Without investigating the issue, it would be difficult to say whether this is an AMD OpenCL driver bug or not. We need a minimal test-case that manifests the issue. Last time I was unable to isolate the related host-code to create a test-case. If you can provide a repro, I can help you to report the issue to the OpenCL team.

 

Thanks.

Dear Dipak,

many thanks for your kind response. Today, I have tried to create a minimum test code to figure out the problem. I am amazed by the fact there is no problem at all with AMD's driver. By going back to the coherency part of the code in octave ocl, I found a code bug that was related to a static constructor. This was only reported by a compiler from the OpenSUSE environment. At that time, I did not realize this could lead to a deduction problem.  So now, I could use the two kernels to do the job separately without the coherency problem. Judging from this case along with the previous kernel problem that you helped me to solve, I have to say you guys have a very high-quality implementation for OpenCL.

P.S. I have helped Intel to local a bug in their OpenCL driver for HD graphics in  Re: Driver problem with OpenCL kernel - Intel CommunitiesRe: Driver problem with OpenCL kernel - Intel Communities

Best wishes,

Jinchuan

Hi Jinchuan,

Thank you for your kind words. It's good to know that you have found the root-cause of the issue and fixed it. I hope it will help others to run the Octave-ocl package on AMD gpus. I really appreciate your effort.

Thanks.