- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
clEnqueueNDRangeKernel segfaults in Opencl 3.0
Hi,
Using OpenCL 3.0 in Ubuntu 20.04. getDevices reports only 1 platform, my CPU as explained in another post.
When compiling/running the same program, time2freq.c with the clEnqueNDRangeKernel command commented out,
time2freq runs fine to completion. When placing it back, it segfaults after ~200/16000 runs, gdb shows this backtrace:
Thread 3 "time2freq" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff14bf700 (LWP 51626)]
0x00007ffff436f4bd in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
(gdb) where
#0 0x00007ffff436f4bd in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
#1 0x00007ffff436f5a5 in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
#2 0x00007ffff42fd01f in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
#3 0x00007ffff436ab4c in ?? () from /opt/AMDAPPSDK-3.0/lib/x86_64/libamdocl64.so
#4 0x00007ffff7c38609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007ffff7b5d133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
Sometimes it also reports "Corrupted size vs. previous size"
The offending code is this:
global[0] = FFT_SZ/8;
global[1] = len/FFT_SZ;
local[0] = global[0];
local[1] = 1;
Run kernel
if ((err = clEnqueueNDRangeKernel(cq, fft_kernel, 2, NULL, global, local, 0, NULL, &ndr)))
{
error(log, "clEnqueueNDRangeKernel(%s) failed (%s)\n", 0, FL, LN, FN, clError(err));
return(FAIL);
}
if ((err = waitForEventAndRelease(&ndr)))
{
error(log, "waitForEventAndRelease(ndr) failed (%s)\n", 0, FL, LN, FN, clError(err));
return(FAIL);
}
I can provide full demo sources if needed.
Unfortunately I cannot use valgrind with the kernel. Execution terminates at the failed kernel linking steps.
However, I can if I comment out the kernel building steps and the kernel. In this case valgrind doesn't report
anything suspicious and there is no segfault.
TIA
Nikos
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Issue resolved.
A short read (len = 440) at signal end resulted into global[1] = 0.
That's an illegal value for global[1].
Ocl 1.2 resulted in segfault. Could not debug it since libOpenCL.so was
closed code.
Since then I compiled rocm-5.2.0 from sources, which also gave me
sources for libOpenCL.so. Using gdb, I was able to track the problem.
Additionally moved to ocl 2.0, which just reports the offending global[1]
as an error (CL_INVALID_GLOBAL_WORK_SIZE) and doesn't segfault:)
I understand the meaning of global[0] and local[0]. I am still missing
the use of global[1] & local[1]. Any insights are welcome:)
TIA
Nikos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Issue resolved.
A short read (len = 440) at signal end resulted into global[1] = 0.
That's an illegal value for global[1].
Ocl 1.2 resulted in segfault. Could not debug it since libOpenCL.so was
closed code.
Since then I compiled rocm-5.2.0 from sources, which also gave me
sources for libOpenCL.so. Using gdb, I was able to track the problem.
Additionally moved to ocl 2.0, which just reports the offending global[1]
as an error (CL_INVALID_GLOBAL_WORK_SIZE) and doesn't segfault:)
I understand the meaning of global[0] and local[0]. I am still missing
the use of global[1] & local[1]. Any insights are welcome:)
TIA
Nikos
