OpenCL

nibal

Hi,

Using OpenCL 3.0 in Ubuntu 20.04. getDevices reports only 1 platform, my CPU as explained in another post.

When compiling/running the same program, time2freq.c with the clEnqueNDRangeKernel command commented out,

time2freq runs fine to completion. When placing it back, it segfaults after ~200/16000 runs, gdb shows this backtrace:

Thread 3 "time2freq" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff14bf700 (LWP 51626)]
0x00007ffff436f4bd in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
(gdb) where
#0 0x00007ffff436f4bd in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
#1 0x00007ffff436f5a5 in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
#2 0x00007ffff42fd01f in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
#3 0x00007ffff436ab4c in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
#4 0x00007ffff7c38609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007ffff7b5d133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

Sometimes it also reports "Corrupted size vs. previous size"

The offending code is this:

global[0] = FFT_SZ/8;
global[1] = len/FFT_SZ;
local[0] = global[0];
local[1] = 1;
Run kernel
if ((err = clEnqueueNDRangeKernel(cq, fft_kernel, 2, NULL, global, local, 0, NULL, &ndr)))
{
error(log, "clEnqueueNDRangeKernel(%s) failed (%s)\n", 0, FL, LN, FN, clError(err));
return(FAIL);

}
if ((err = waitForEventAndRelease(&ndr)))
{
error(log, "waitForEventAndRelease(ndr) failed (%s)\n", 0, FL, LN, FN, clError(err));
return(FAIL);
}

I can provide full demo sources if needed.

Unfortunately I cannot use valgrind with the kernel. Execution terminates at the failed kernel linking steps.

However, I can if I comment out the kernel building steps and the kernel. In this case valgrind doesn't report

anything suspicious and there is no segfault.

TIA

Nikos

nibal

Issue resolved.

A short read (len = 440) at signal end resulted into global[1] = 0.

That's an illegal value for global[1].

Ocl 1.2 resulted in segfault. Could not debug it since libOpenCL.so was

closed code.

Since then I compiled rocm-5.2.0 from sources, which also gave me

sources for libOpenCL.so. Using gdb, I was able to track the problem.

Additionally moved to ocl 2.0, which just reports the offending global[1]

as an error (CL_INVALID_GLOBAL_WORK_SIZE) and doesn't segfault:)

I understand the meaning of global[0] and local[0]. I am still missing

the use of global[1] & local[1]. Any insights are welcome:)

TIA

Nikos

View solution in original post

nibal

Issue resolved.

A short read (len = 440) at signal end resulted into global[1] = 0.

That's an illegal value for global[1].

Ocl 1.2 resulted in segfault. Could not debug it since libOpenCL.so was

closed code.

Since then I compiled rocm-5.2.0 from sources, which also gave me

sources for libOpenCL.so. Using gdb, I was able to track the problem.

Additionally moved to ocl 2.0, which just reports the offending global[1]

as an error (CL_INVALID_GLOBAL_WORK_SIZE) and doesn't segfault:)

I understand the meaning of global[0] and local[0]. I am still missing

the use of global[1] & local[1]. Any insights are welcome:)

TIA

Nikos

OpenCL

clEnqueueNDRangeKernel segfaults in Opencl 3.0