I spent hours reducing a bug in a complex OpenCL using printf() statements to debug to this very simple, very short program based on the HelloWorld sample that ships with the AMD APP SDK (14 lines of OpenCL, ~170 lines of C): https://pastebin.com/raw/8N4Ms2JT
On Linux: compile with:
$ g++ -I/opt/AMDAPPSDK-3.0/include -o main.o -c main.cpp
$ g++ -o hello main.o -rdynamic -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -lOpenCL
The kernel launches 8192 threads and each thread prints 32 integers (with a dummy use of the input/output buffers to prevent the compiler from optimizing out the while loop):
__kernel void helloworld(__global uchar* in, __global uchar* out)
{
uint tid = get_global_id(0);
uint step = 32;
uint i = tid * step;
uint j = (tid + 1) * step;
printf("%d: thread started\n", tid);
while (i < j)
{
printf("integer %d\n", i);
i++;
}
out[tid] = in[tid] + i;
}
However the output of some threads is missing (threads 0-1279 are missing):
$ ./hello | grep thread | sort -n | head
1280: thread started
1281: thread started
1282: thread started
1283: thread started
1284: thread started
...
If I change the local work size from 64 to 128 (see local_work_size variable) the output is correct:
$ ./hello | grep thread | sort -n | head
0: thread started
1: thread started
2: thread started
3: thread started
...
While counting the lines output by the kernel, I noticed it seems the root of the problem could be that the OpenCL printf() implementation is limited to printing at most 65536 lines (per kernel invocation?). The output would be different depending on the local work size because in one case the first work groups would be scheduled differently (eg. scheduled AFTER 65536 lines have been printed out) and this would be why their output is missing. In my original real-world program it seems that if a thread attempts to printf() a line after 65536 lines have already been printed out, then it silently dies/stops.
System details:
Ubuntu 16.04, 64-bit
AMDGPU-PRO driver 16.60.3
AMD APP SDK 3.0
Radeon RX 480
Solved! Go to Solution.
Hi Marc,
There is a size limit for printf buffer. The actual buffer size is implementation dependent and may vary device to device. As per OpenCL 1.2 spec, the minimum size is 1MB (for FULL profile). You can find out the size by calling clGetDeviceInfo with parameter CL_DEVICE_PRINTF_BUFFER_SIZE. Another point is, depending on the implementation, this printf buffer might be shared with other kernels too (if running any), thus may reduce the effective size for any particular kernel.
Regarding the order of printf output, there is no guarantee and it can differ run to run even for same kernel settings. That's why sometimes it is hard to debug using printf if many threads are running and generating lots of outputs.
Regards,
Hi Marc,
There is a size limit for printf buffer. The actual buffer size is implementation dependent and may vary device to device. As per OpenCL 1.2 spec, the minimum size is 1MB (for FULL profile). You can find out the size by calling clGetDeviceInfo with parameter CL_DEVICE_PRINTF_BUFFER_SIZE. Another point is, depending on the implementation, this printf buffer might be shared with other kernels too (if running any), thus may reduce the effective size for any particular kernel.
Regarding the order of printf output, there is no guarantee and it can differ run to run even for same kernel settings. That's why sometimes it is hard to debug using printf if many threads are running and generating lots of outputs.
Regards,
Thank you, I somehow missed this 1MB limit when reading the docs. I didn't mention it but the test program linked in my first post actually segfaults. Is this normal behavior when exceeding 1MB printed?
It should not generate a sefault for exceeding the buffer limit. The printf outputs might be ignored i.e. not get printed.
Regards,
Well I hereby submit a bug report then The code I attached is pretty much the HelloWorld sample from the AMD APP SDK and it segfaults.
Thanks for reporting it and sharing the repro. May I assume below points regarding the attached code?
Please confirm.
1. Yes.
2. Yes (no segfault with 1536 threads, segfaults with 2048 threads.)
Thanks for the confirmation. I've already opened a ticket against it.
Regards,