cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

CaptainN
Journeyman III

cl::Program::build() crashes for x86. Fine for GPU.

OpenCL run-time build for x86 crashes with floating point division error.

Inside the project there is a collection of open cl kernels as const char * array. One const char * has one or more kernels in form of C string. During runtime const char * strings collected as cl::Sources and compiled into program using open cl Program::build().  If target device set as GPU everything works nice.

If target device set as CPU Program::build() crashes somewhere in open cl run-time. In attempt to narrow down the problem I have just single kernel which compiles to GPU but crashes when target device is CPU. If this kernel taken to test environment as HelloOpenCl sample it compiles for CPU fine! Kernel Analyzer 1.8 generates x86 assembly with no problem.

Program build() crashes with following message: Unhandled exception at 0x… in project.exe: 0xC000008E: Floating-point division by zero.

Stack trace starts from: mydll!cl:rogram::build(&hellip and dives into amdocl.dll where eventually crashes. (of course, no names as I don't have pdbs).

System: HD5870, Catalist 11.5, sdk 2.4.

CPU: Intel Xeon (don’t think it is relate here).

Could you please advise how can I resolve the problem?

P.S. I have tried to trim the kernel which crashes on CPU (it is not a big kernel though) and at some point it compiles (but useless for me). Again, it crashes only when target device CPU. If target device GPU everything is ok.

0 Likes
6 Replies
rick_weber
Adept II

Try using this environment variable and see what happens:

AMD_OCL_BUILD_OPTIONS_APPEND="-g -O0"

If it still crashes, then the problem isn't in the optimization.

You problem is very odd since dividing by zero is well defined in floating point (you get inf).

0 Likes
CaptainN
Journeyman III

rick.weber !!!

This is it! Passing this parameter to cl:rogram's build as an option as program.build(devices, "-g -O0") also makes the problem disappear.

AMD, Please let me know how can I help to have the problem located and get fixed in the next release. I can generate the crash dump if it will help.

Respect.

0 Likes

sorry for the lare reply.

Can you post a testcase showing this issue here. You can also file a ticket.

0 Likes

CaptainN,

Did this problem ever get reported to, or resolved by, AMD?

I think I've just hit the same problem [Windows 7 x64, AMD APP SDK 2.4, Firepro v8800]

Unhandled exception at 0x0f5d18bb in XXX.exe: 0xC000008E: Floating-point division by zero.

0F5D18B1  fdivr       dword ptr [esi+4]
0F5D18B4  lea         eax,[esp+118h]
0F5D18BB  fstp        dword ptr [esi+4]

Register ST0 has the value 0 so I suspect the FDIVR instruction is the cause.

I explicitly trap floating point division by zero in our code

    unsigned int flags = _controlfp(0, 0);    // get current control word
    flags &= ~(_EM_OVERFLOW | _EM_ZERODIVIDE);    // enable required exceptions
    _controlfp(flags, _MCW_EM);    // set control word

Like you I've tried to narrow down my kernel but without any great success.

I eventually narrowed it down to the point where adding this line to my kernel

distances[point_index] = distance;

would cause the floating point division by zero exception to be generated.

[ where: __global float* distances, size_t point_index, float distance]

 

Also like you, it only fails when the device is CL_DEVICE_TYPE_CPU, and if I try to compile my kernel in a simple test program it compiles correctly, even for CL_DEVICE_TYPE_CPU.

 

Steve.

0 Likes

steveyoungs,

It is difficult to find why the problem might be happening.

Please post a testcase. and system information: CPU,GPU,SDK,Driver,OS.

0 Likes

System information is easy:

Intel i7 930

Windows 7 x64

AMD APP SDK 2.4

Firepro v8800

driver v8.85

[nvidia GPU and driver and Intel OpenCL CPU drivers also installed]

Although since the problem affected compiling the kernel source to a CPU device, I suspect the GPU and driver are not significant.

As I alluded to in my earlier post, I'm unable to make a simple test case yet - I only get the error when I compile the kernel in our full application. Even then, seemingly insignificant changes to the kernel source can make the problem go away. When I compile the exact same kernel in a simple test program it compiles correctly.

If I don't hear back from CaptainN, or it is not already logged, I'll open a ticket. Do you know the best way to open a ticket?

0 Likes