cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

shri
Adept I

cl::Program::build hung

I have an application which runs for some fix iteration. Each iteration it creates new platform, init context and devices, build OpenCL Kernels and runs it and clear devices, context, platform. In some random iteration it fails to build kernels and hangs forever. This happens only if have two GPU, if I use one GPU, i dont see problem. same piece of code works file when i run on W9100 GPU but fails (in a way I described above) on 2 WX9100 GPUs. 

Any help would be appreciated. Below is my environment. 

 

output of clinfo.

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XT [Radeon PRO WX 9100]
Device Topology: PCI[ B#61, D#0, F#0 ]
Max compute units: 64
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1500Mhz
Address bits: 64
Max memory allocation: 14588628168
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 8192
Max samplers within kernel: 26721
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 17163091968
Constant buffer size: 14588628168
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 1703726280
Max global variable size: 14588628168
Max global variable preferred total size: 17163091968
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7fd3e2148db0
Name: gfx900:xnack-
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3261.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program


Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XT [Radeon PRO WX 9100]
Device Topology: PCI[ B#179, D#0, F#0 ]
Max compute units: 64
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1500Mhz
Address bits: 64
Max memory allocation: 14588628168
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 8192
Max samplers within kernel: 26721
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 17163091968
Constant buffer size: 14588628168
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 1703726280
Max global variable size: 14588628168
Max global variable preferred total size: 17163091968
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7fd3e2148db0
Name: gfx900:xnack-
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3261.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

--------------------

below is back trace using gdb when my application hung

Switching to thread 15 (Thread 0x7fb32e7f4700 (LWP 20928))]
#0 0x00007fb341fcbc07 in sched_yield () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fb341fcbc07 in sched_yield () from /lib64/libc.so.6
#1 0x00007fb33ab6992d in rocr::AMD::AqlQueue::ExecutePM4(unsigned int*, unsigned long) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1
#2 0x00007fb33ab5bd01 in rocr::AMD::GpuAgent::InvalidateCodeCaches() () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1
#3 0x00007fb33ab6bfbf in rocr::amd::LoaderContext::SegmentAlloc(amdgpu_hsa_elf_segment_t, hsa_agent_s, unsigned long, unsigned long, bool) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1
#4 0x00007fb33abb3cbf in rocr::amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(hsa_agent_s, rocr::amd::hsa::code::AmdHsaCode const*) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1
#5 0x00007fb33abb43de in rocr::amd::hsa::loader::ExecutableImpl::LoadSegments(hsa_agent_s, rocr::amd::hsa::code::AmdHsaCode const*, unsigned int) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1
#6 0x00007fb33abb7ce7 in rocr::amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, unsigned long, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, hsa_loaded_code_object_s*) ()
from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1
#7 0x00007fb33abb8543 in rocr::amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, hsa_loaded_code_object_s*) ()
from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1
#8 0x00007fb33ab87879 in rocr::HSA::hsa_executable_load_agent_code_object(hsa_executable_s, hsa_agent_s, hsa_code_object_reader_s, char const*, hsa_loaded_code_object_s*) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1
#9 0x00007fb3484459a8 in roc::LightningProgram::setKernels(amd::option::Options*, void*, unsigned long, int, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /opt/amdgpu-pro/lib64/libamdocl64.so
#10 0x00007fb34843e520 in device::Program::linkImplLC(amd::option::Options*) () from /opt/amdgpu-pro/lib64/libamdocl64.so
#11 0x00007fb34843ef4d in device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /opt/amdgpu-pro/lib64/libamdocl64.so
#12 0x00007fb3483e8b11 in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool, bool) () from /opt/amdgpu-pro/lib64/libamdocl64.so
#13 0x00007fb3483b86c3 in clBuildProgram () from /opt/amdgpu-pro/lib64/libamdocl64.so
#14 0x0000000000882529 in cl::Program::build (data=0x0, notifyFptr=0x0,
options=0x7faec81df590 "-cl-mad-enable -cl-std=CL2.0 -D NX=140 -D NY=140 -D NZ=89 -D NZ_ALIGNED=92 -D NU=357 -D NT=27 -D NT_MINUS1_DIV2=13 -D NV=1981 -D NV_ALIGNED=2116 -D NV_LOOP=576 -D NALPHA=22 -D N_UDB=360 -D ZBUFFER_SIZ"...,
devices=std::vector of length 1, capacity 1 = {...}, this=<optimized out>, this=<optimized out>, this=<optimized out>) at....

0 Likes
14 Replies
dipak
Big Boss

Hi @shri ,

Thank you for reporting it. I have whitelisted you and moved the post to the OpenCL forum.

Please provide a minimal test-case that reproduces the above issue. Also, please share the setup details like OS, driver version etc.

Thanks.

 

0 Likes

sorry, I can not provide steps to reproduce as this is very internal to our organization  

OS is SLES 15 SP3

Drivers are as below

amdgpu-dkms-5.11.5.32-1310811.noarch

opencl-rocr-amdgpu-pro-21.20-1310811.x86_64

amdgpu-core-21.20-1310811.noarch

amdgpu-pro-core-21.20-1310811.noarch

comgr-amdgpu-pro-2.1.0-1310811.x86_64

hsa-runtime-rocr-amdgpu-1.3.0-1310811.x86_64

amdgpu-pro-versionlist-21.20-1310811.noarch

libdrm-amdgpu-common-1.0.0-1310811.noarch

ocl-icd-amdgpu-pro-21.20-1310811.x86_64

clinfo-amdgpu-pro-21.20-1310811.x86_64

amdgpu-pro-rocr-opencl-21.20-1310811.x86_64

amdgpu-versionlist-21.20-1310811.noarch

libdrm-amdgpu-2.4.100-1310811.x86_64

hsakmt-roct-amdgpu-1.0.9-1310811.x86_64

hip-rocr-amdgpu-pro-21.20-1310811.x86_64

amdgpu-dkms-firmware-5.11.5.32-1310811.noarch

opencl-orca-amdgpu-pro-icd-21.20-1310811.x86_64

---------------------------------------------

Below is the trace where it is hung 

-------------- thread 1 -------------

0x00007ff564b90c07 in sched_yield () from /lib64/libc.so.6

#1  0x00007ff5583ca92d in rocr::AMD::AqlQueue::ExecutePM4(unsigned int*, unsigned long) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#2  0x00007ff5583bcd01 in rocr::AMD::GpuAgent::InvalidateCodeCaches() () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#3  0x00007ff5583ccfbf in rocr::amd::LoaderContext::SegmentAlloc(amdgpu_hsa_elf_segment_t, hsa_agent_s, unsigned long, unsigned long, bool) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#4  0x00007ff558414cbf in rocr::amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(hsa_agent_s, rocr::amd::hsa::code::AmdHsaCode const*) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#5  0x00007ff5584153de in rocr::amd::hsa::loader::ExecutableImpl::LoadSegments(hsa_agent_s, rocr::amd::hsa::code::AmdHsaCode const*, unsigned int) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#6  0x00007ff558418ce7 in rocr::amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, unsigned long, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, hsa_loaded_code_object_s*) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

------------ thread 2 -------------

(gdb) bt

#0  0x00007ff564ba0807 in ioctl () from /lib64/libc.so.6

#1  0x00007ff5490271d8 in kmtIoctl (fd=41, request=request@entry=3222817548, arg=arg@entry=0x7ff557f51ad0) at /home/foreman/rpmbuild/SOURCES/libhsakmt/src/libhsakmt.c:13

#2  0x00007ff54901fd3b in hsaKmtWaitOnMultipleEvents (Events=0x7ff557f51bf0, NumEvents=3, WaitOnAll=<optimized out>, Milliseconds=<optimized out>) at /home/foreman/rpmbuild/SOURCES/libhsakmt/src/events.c:312

#3  0x00007ff55840cb85 in rocr::core::Signal::WaitAny(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, unsigned long, hsa_wait_state_t, long*) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#4  0x00007ff5583efe9e in rocr::AMD::hsa_amd_signal_wait_any(unsigned int, hsa_signal_s*, hsa_signal_condition_t*, long*, unsigned long, hsa_wait_state_t, long*) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#5  0x00007ff558404520 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#6  0x00007ff5583aff37 in rocr::os::ThreadTrampoline(void*) () from /opt/amdgpu-pro/lib64/libhsa-runtime64.so.1

#7  0x00007ff564e8794a in start_thread () from /lib64/libpthread.so.0

#8  0x00007ff564baad0f in clone () from /lib64/libc.so.6

----------------------------------------------------

 

any help would be appreciated. 

 

 

0 Likes

I understand your concern. However, as you mentioned that the hang occurs randomly, it would be difficult for us to investigate the issue without a reproducible test-case.  If you don't want to post it here, you can send the repro (code/binary) to us privately via email or PM. I can share my official email id if needed.

Another point, from the above driver information, it seems like an older driver which is based on 21.20. The latest driver 21.Q4 (base driver 21.40) is available here: https://www.amd.com/en/support/kb/release-notes/rn-pro-lin-21-q4

Please try this driver to see if the issue still persists.

Thanks.

 

0 Likes

Thank you for reply. 

I have installed new driver yesterday "amdgpu-*-21.50" but no success. If you share your email I can share some more information. 

0 Likes

Thanks for the information. I have sent you a private message. Please check your community inbox.

0 Likes

We did try the new driver (21.50.2), it did not help and we see the same behavior. 

0 Likes

Can you send me the logs including OS distribution, Kernel version. Platform, system BIOS version?

0 Likes

OS is SLES 15 SP3
Kernel is 5.3.18-59.24-default
Platform is Dell PowerEdge T640 (built by Dell)
BIOS version is 1.4.8 (This is mid-2018.. is it reasonable that that the BIOS rev impacts this behavior?)

0 Likes

BIOS updated didnt help as well... We see the same Hang with the updated driver and bios

0 Likes

We need to repro the issue. Can you assist?

0 Likes

Please let us know how to assist... unfortunately our attempt to reproduce in private environment is not successfully and we may not be able to share the production code.

On a separate topic We used AGT tool to get GPU diagnostics on W9100 and Wx9100 GPUs... but with the latest driver the AGT tool wont report intermittently for some W9100 GPUs. we have requested for updated AGT tool, but havn't heard back from AMD on this yet.   

0 Likes

Which version of AGT tool did you try?

0 Likes