cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sk7041
Journeyman III

AMD OpenCL - compiler segmentation fault

Hello All,

I have recently been testing my OpenCL code on an AMD HD7970 GPU, and some of my kernels are causing the compiler to crash with a segmentation fault at clBuildProgram() . I would like to mention that the kernels compile and run fine on any NVIDIA device, and on CPU with AMD SDK.

Information about my system:

Description:  Debian GNU/Linux 6.0.1 (squeeze)

Arch: x86_64

CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4200+

GPU: AMD HD7970

AMD OpenCL SDK: 2.8

Driver: AMD Catalyst proprietary driver 12.10

Here are the back traces given by GDB for the 2 kernels that produce the segmentation fault. I would like to mention that these are ~400 lines long kernels, with many nested 'for' loops, and requiring a fairly large amount of private memory.

back trace for kernel 1:

Program received signal SIGSEGV, Segmentation fault.

0x00007ffff49a1b88 in SCRegSpill::CreateReload(SCInst*, int, SCInst*, SCBlock*, bitset*, bitset*, int) () from /usr/lib/libamdocl64.so

(gdb) bt

#0  0x00007ffff49a1b88 in SCRegSpill::CreateReload(SCInst*, int, SCInst*, SCBlock*, bitset*, bitset*, int) () from /usr/lib/libamdocl64.so

#1  0x00007ffff49b2533 in SCRegSpill::Spill() () from /usr/lib/libamdocl64.so

#2  0x00007ffff49b5160 in SCRegAlloc::Allocate(bool) () from /usr/lib/libamdocl64.so

#3  0x00007ffff49b54af in SCRegAlloc::AllocateRegisters() () from /usr/lib/libamdocl64.so

#4  0x00007ffff45b0b5f in CompilerBase::GenerateCodeUsingNewIR(void*, bool) () from /usr/lib/libamdocl64.so

#5  0x00007ffff45b6764 in Compiler::Compile(ILProgram*) () from /usr/lib/libamdocl64.so

#6  0x00007ffff45b6ee0 in Compiler::CompileShader(unsigned char*, unsigned char*, unsigned int const*, CompilerExternal*) ()

   from /usr/lib/libamdocl64.so

#7  0x00007ffff45b3227 in CompilerExternal::CompileShader(_SC_SRCSHADER const*, _SC_HWSHADER*) () from /usr/lib/libamdocl64.so

#8  0x00007ffff49cffc2 in scWrapCompileBinarySI(void*, unsigned int, void**, unsigned int*, unsigned int, unsigned int, scWrapOptionEnum*)

    () from /usr/lib/libamdocl64.so

#9  0x00007ffff458df6b in amuCompCompile () from /usr/lib/libamdocl64.so

#10 0x00007ffff458ecee in ddiCompile () from /usr/lib/libamdocl64.so

#11 0x00007ffff44cb91e in gpu::NullKernel::create(stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, void const*, unsigned long) ()

   from /usr/lib/libamdocl64.so

#12 0x00007ffff44d05d3 in gpu::Kernel::create(stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, void const*, unsigned long) ()

   from /usr/lib/libamdocl64.so

#13 0x00007ffff44df058 in gpu::Program::createKernel(stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, gpu::Kernel::InitData const*, stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, bool*, void const*, unsigned long) ()

   from /usr/lib/libamdocl64.so

#14 0x00007ffff44de2ca in gpu::NullProgram::linkImpl(amd::option::Options*) () from /usr/lib/libamdocl64.so

#15 0x00007ffff4479055 in device::Program::build(stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, char const*, amd::option::Options*) () from /usr/lib/libamdocl64.so

#16 0x00007ffff4489030 in amd::Program::build(stlp_std::vector<amd::Device*, stlp_std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool) () from /usr/lib/libamdocl64.so

#17 0x00007ffff4466ff3 in clBuildProgram () from /usr/lib/libamdocl64.so

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

back trace for kernel 2:

Program received signal SIGSEGV, Segmentation fault.

0x00007ffff4936199 in SC_SCCGCM::GetEarly(SCInst*) () from /usr/lib/libamdocl64.so

(gdb) bt

#0  0x00007ffff4936199 in SC_SCCGCM::GetEarly(SCInst*) () from /usr/lib/libamdocl64.so

#1  0x00007ffff49363a4 in SC_SCCGCM::ComputeEarlyPosition(SCInst*, FuncRegion*) () from /usr/lib/libamdocl64.so

#2  0x00007ffff49c5848 in SC_SCCGVN::GVNSCCInst(SCInst*, SC_SCCVN*) () from /usr/lib/libamdocl64.so

#3  0x00007ffff49c7704 in SCCVNBase<SCInst, SC_CurrentValue>::VNSCCInst(SCInst*) () from /usr/lib/libamdocl64.so

#4  0x00007ffff49c6ff5 in SC_SCCBLK::VNSCCItem(int) () from /usr/lib/libamdocl64.so

#5  0x00007ffff49c7a97 in void SCCVNBase<SCInst, SC_CurrentValue>::ProcessSCC<SC_SCCBLK>(SC_SCCBLK*, int) () from /usr/lib/libamdocl64.so

#6  0x00007ffff4938a5f in SCC_BASE<SCBlock>::SCC(SCBlock*) () from /usr/lib/libamdocl64.so

#7  0x00007ffff49c69ad in SC_SCCBLK::Traversal() () from /usr/lib/libamdocl64.so

#8  0x00007ffff45b07d3 in CompilerBase::GenerateCodeUsingNewIR(void*, bool) () from /usr/lib/libamdocl64.so

#9  0x00007ffff45b6764 in Compiler::Compile(ILProgram*) () from /usr/lib/libamdocl64.so

#10 0x00007ffff45b6ee0 in Compiler::CompileShader(unsigned char*, unsigned char*, unsigned int const*, CompilerExternal*) ()

   from /usr/lib/libamdocl64.so

#11 0x00007ffff45b3227 in CompilerExternal::CompileShader(_SC_SRCSHADER const*, _SC_HWSHADER*) () from /usr/lib/libamdocl64.so

#12 0x00007ffff49cffc2 in scWrapCompileBinarySI(void*, unsigned int, void**, unsigned int*, unsigned int, unsigned int, scWrapOptionEnum*)

    () from /usr/lib/libamdocl64.so

#13 0x00007ffff458df6b in amuCompCompile () from /usr/lib/libamdocl64.so

#14 0x00007ffff458ecee in ddiCompile () from /usr/lib/libamdocl64.so

#15 0x00007ffff44cb91e in gpu::NullKernel::create(stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, void const*, unsigned long) ()

   from /usr/lib/libamdocl64.so

#16 0x00007ffff44d05d3 in gpu::Kernel::create(stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, void const*, unsigned long) ()

   from /usr/lib/libamdocl64.so

#17 0x00007ffff44df058 in gpu::Program::createKernel(stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, gpu::Kernel::InitData const*, stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, bool*, void const*, unsigned long) ()

   from /usr/lib/libamdocl64.so

#18 0x00007ffff44de2ca in gpu::NullProgram::linkImpl(amd::option::Options*) () from /usr/lib/libamdocl64.so

#19 0x00007ffff4479055 in device::Program::build(stlp_std::basic_string<char, stlp_std::char_traits<char>, stlp_std::allocator<char> > const&, char const*, amd::option::Options*) () from /usr/lib/libamdocl64.so

#20 0x00007ffff4489030 in amd::Program::build(stlp_std::vector<amd::Device*, stlp_std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool) () from /usr/lib/libamdocl64.so

#21 0x00007ffff4466ff3 in clBuildProgram () from /usr/lib/libamdocl64.so

The kernel code is proprietary code so I cannot post it on this forum, but I accept sending it to the AMD compiler dev team if need be. Please get in touch if you would like me to do so.

Regards,

Simon

0 Likes
11 Replies
himanshu_gautam
Grandmaster

13.1 is the latest driver. Can you try with the latest driver?

0 Likes
himanshu_gautam
Grandmaster

Hi,

I am not sure if there is a private channel for sending bugs to AMD.

There used to be ticket mechanism earlier, but is no longer present. http://developer.amd.com/support/

Anyways, it is very helpful for us here to confirm & fix a issue, if it is easily reproducible. I would suggest you to do some homework and try coming up with a simple testcase that can be shared here.

Also Try CodeXL and Stream Kernel analyer. You might find some help with your kernel.

0 Likes

Hi Himanshu,

Thank you for your suggestions. Using the latest stable driver (13.1) makes no difference, I still get the same seg faults. I also noted that kernels that compile can take up to 15 seconds to compile, and reading/writing buffers of ~100MB can take up to 20 seconds. This is a ridiculous overhead, considering the computation takes about 20 seconds to complete.

Regards,

Simon

0 Likes

20 seconds for 100MB?

Even if this is via PCIe, It should not be greater than 20 milliseconds.

So this is 1000x slower.

Can you run the bufferbandwidth SDK sample and post the results?

0 Likes

Hi Himanshu,

Thank you for your help, I am sure it is not a hardware problem. Here is the output from bufferbandwidth:

Platform found : Advanced Micro Devices, Inc.

Device  0            Tahiti

Build:               DEBUG

GPU work items:      32768

Buffer size:         33554432

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        20

inputBuffer:         CL_MEM_READ_ONLY

outputBuffer:        CL_MEM_WRITE_ONLY

Host baseline (naive):

Timer resolution     1690.84 ns

Page fault           4445.19 ns

CPU read             3.29 GB/s

memcpy()             2.66 GB/s

memset(,1,)          3.69 GB/s

memset(,0,)          3.69 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

      clEnqueueMapBuffer(WRITE):  0.012605 s [     2.66 GB/s ]

                       memset():  0.019332 s       1.74 GB/s

      clEnqueueUnmapMemObject():  0.010468 s [     3.21 GB/s ]

2. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.004686 s     143.22 GB/s

                 verification ok

3. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.005745 s     116.82 GB/s

4. Host mapped read of outputBuffer

       clEnqueueMapBuffer(READ):  0.011147 s [     3.01 GB/s ]

                       CPU read:  0.023514 s       1.43 GB/s

                 verification ok

      clEnqueueUnmapMemObject():  0.000027 s [  1229.44 GB/s ]

Passed!

And here is the output from my program (buffers are written before any computation has started on the device):

Device [0]

    Device ID: 0xb714df0

    Device name: Tahiti

    Device vendor: Advanced Micro Devices, Inc.

    Device compute units: 32 - Clock freq: 925MHz

    Device global mem: 2048MB - Device type: 4

Wrote Buffer of Size: 768 bytes (0MB) in 2.88797 seconds

Wrote Buffer of Size: 13285376 bytes (12MB) in 0.062138 seconds

Wrote Buffer of Size: 127744 bytes (0MB) in 0.054394 seconds

Wrote Buffer of Size: 510976000 bytes (487MB) in 9.27033 seconds

Wrote Buffer of Size: 9216 bytes (0MB) in 0.000461 seconds

Wrote Buffer of Size: 172709888 bytes (164MB) in 0.358325 seconds

Wrote Buffer of Size: 159424512 bytes (152MB) in 45.1698 seconds

Wrote Buffer of Size: 9216 bytes (0MB) in 0.260514 seconds

Wrote Buffer of Size: 1532928 bytes (1MB) in 1.24125 seconds

Wrote Buffer of Size: 79712256 bytes (76MB) in 5.10956 seconds

Notice the 152MB took 45 seconds to write. Sometimes the previous buffer (164MB) takes up to 20 seconds. Not sure why this time it did it in 0.3. In any case, all these numbers are above the PCIe speed you mentioned. It must be a driver issue...(this was tested with 13.2 beta)

Many thanks,

Simon

0 Likes

Hi Simon,

As per Buffer Bandwidth Sample you are getting ~3GBps of read and write speed. Can you check the sample and try to figure out, what you might be doing wrong.

Reaching optimal data transfer speeds is an art in itself . But in any case it should not be that slow. Do check your timers are reliable, and you are not timing something extra with data transfer.

0 Likes

Hi Himanshu,

Thank you for suggesting implementing optimal data transfer. I can assure you that my timers are reliable as shown by running the same executable on 6 other devices from different vendors, and getting data transfer time of at most 0.2 seconds. This is a driver issue..

Regards,

Simon

0 Likes

Hi Simon,

Can you post some code here (in ZIP) that can showcase this long time taken for data transfer. I will try to reproduce and suggest changes that might help you improve the transfer speed.

0 Likes

Hi Himanshu,

It seems like that problem was fixed by reducing the amount of private memory requested in a kernel. One specific kernel that uses a lot of private memory arrays is compiled before any memory transfers are made from the host to the device and it seems that reducing the size of these arrays has fixed the problem of slow host-to-device memory transfers, but I am not too sure why. Does the OpenCL context allocate device private memory for a kernel as it is compiled, or at runtime? If allocation happens at compile time then this could have been the cause of my problems.

Anyway all works well now, so to sum up:

1) Update to newest drivers to avoid compiler segmentation fault.

2) Beware of how  much private memory your kernels are using!

Many thanks for your help.

Regards,

Simon

0 Likes

Hi Himanshu,

I would like to mention that using the latest beta driver (13.2 beta) has fixed the segmentation faults. However, the kernels still take a long time to compile, and memory read/writes are still very lengthy...

Regards,

Simon

0 Likes

Hi Simon,

Thanks for keeping us in loop. It is good to hear that 13.2 fixed the crashes.

Checkout Table 4.2 in AMD OpenCL Programming Guide for getting optimal data transfer though. It might be helpful.

0 Likes