cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

doug65536
Journeyman III

OpenCL compiler hangs for long time and eventual AV in kernel cl::Program::build

The details gathered by my program.

platformProfile=FULL_PROFILE, err=0

platformVendor=Advanced Micro Devices, Inc., err=0

platformName=AMD Accelerated Parallel Processing, err=0

platformVer=OpenCL 1.1 AMD-APP (898.1), err=0

platformExt=cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing, err=0

OpenCL version=1.1

Windows 7 Ultimate 64 bit.

I just updated my drivers today. The issue happened in the last driver too.

I've been trying many different ways to get correct results on AMD GPU OpenCL.

My program works fine on:

- NVidia GPU

- Intel CPU (OpenCL)

- AMD CPU OpenGL

...but fails on AMD GPU OpenCL (HD5770). "Fails" means that it either produces all zeros as the result, or AVs (if the kernel uses fma instead of mad)

The kernel source code and host program source are attached. I've also attached a mini-dump of the process at the AV.

If you replace the fma calls with mad calls, it doesn't crash the compiler, but all the results are zeros in that case. I've tried using constant memory and pointers, I've tried copying to local memory and doing the compute from there. In all those cases, one of the four implementations did not work correctly. The code right now generates inline code.

I know there are faster ways to do DCTs. This is an experimental program to try out different OpenCL techniques and to experiment with multi-gpu and overlapping reads/writes/executes, and to experiment with out of order queues. I got blocked before implementing much of this, trying to work around issues I encountered on one platform or another.

Please investigate the instability and bad code generation on AMD GPU.

Thanks!

0 Likes
9 Replies

Note that HD5770 doesn't have native FMA support, so that function will be emulated.  It takes many instructions to emulate FMA.  There are macros FP_FAST_FMAF and FP_FAST_FMA which tell you whether FMA is fast for float or doubles, respectively.  See the OpenCL specification, it will have a description like:

The FP_FAST_FMAF macro indicates whether the fma function is fast compared with direct code for single precision floating-point. If defined, the FP_FAST_FMAF macro shall indicate that the fma function generally executes about as fast as, or faster than, a multiply and an add of float operands.

Since we have to inline all those fma function calls, it's going to be a *very* large kernel so that's probably why it's crashing.

0 Likes

On top of that, you're doing fma of float8 values and there are 4096 fma calls.  That is a *lot* of inlined fma functions.

0 Likes

Does that make it ok for the OpenCL compiler to crash my process? Please say you are still treating this as a bug.

0 Likes

No, I wasn't saying it's okay, just indicating that this kernel is a lot larger than it looks   Even with scalars, fma is expensive when the H/W doesn't support it.

That said, it successfully compiled for me but it took a very long time!  Looks like a forthcoming release will fix the problem.

0 Likes

Excellent! I really appreciate you taking the time to do the compile.

0 Likes

Actually the fma are not float8, all the input and output arguments are scalar if you look closely. Only the frst parameter gets a scalar value from the pixels[] vector.

Since you raised concern for the "inlining" of fma I made it just do "t += pixels[A].sB * dctLookup[C]". Now it doesn't crash, gives ALL successful error code return values, and produces all zeros as a result.

Again, this works fine on NVidia GPU, Intel CPU, and AMD CPU OpenCL, but  on AMD GPUeither crashes the process or gets all zero result.

Are you accepting this as a bug to be fixed or are you dismissing it?

Thanks

0 Likes

Ok, so you didn't like the inlining. Here is the same kernel using a lookup table in local memory, and a loop - no inlining.

Same thing. Works on NVidia/GPU, Intel/CPU, AMD/CPU, gets (incorrect) all zeros result AMD/GPU.

0 Likes

Thanks for reporting this. There is a known issue where extremely large kernels will not work on our implementation. This is a design flaw of our intermediate language and will not get fixed anytime soon. I will look into this issue to see if this is the same problem.

Great. You may want to look at the attachments in my NVidia/GPU, Intel/CPU, AMD/CPU.

Expected results:

-26

-3

-6

2

2

-1

0

0

0

-2

-4

1

1

0

0

0

No pressure, I'm not expecting you to reply a solution. Just trying to give you guys (and gals) a good repro to investigate.

Thanks.

Message was edited by: Doug Gale

0 Likes