7 Replies Latest reply on Apr 5, 2017 6:47 AM by dipak

    OpenCL printf() limited to 65536 lines?

    mrbpix

      I spent hours reducing a bug in a complex OpenCL using printf() statements to debug to this very simple, very short program based on the HelloWorld sample that ships with the AMD APP SDK (14 lines of OpenCL, ~170 lines of C): https://pastebin.com/raw/8N4Ms2JT

       

      On Linux: compile with:

      $ g++ -I/opt/AMDAPPSDK-3.0/include  -o main.o -c main.cpp
      $ g++ -o hello main.o -rdynamic -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -lOpenCL

      The kernel launches 8192 threads and each thread prints 32 integers (with a dummy use of the input/output buffers to prevent the compiler from optimizing out the while loop):

      __kernel void helloworld(__global uchar* in, __global uchar* out)
      {
          uint        tid = get_global_id(0);
          uint        step = 32;
          uint        i = tid * step;
          uint        j = (tid + 1) * step;
          printf("%d: thread started\n", tid);
          while (i < j)
            {
              printf("integer %d\n", i);
              i++;
            }
          out[tid] = in[tid] + i;
      }

      However the output of some threads is missing (threads 0-1279 are missing):

      $ ./hello | grep thread | sort -n | head
      1280: thread started
      1281: thread started
      1282: thread started
      1283: thread started
      1284: thread started
      ...

      If I change the local work size from 64 to 128 (see local_work_size variable) the output is correct:

      $ ./hello | grep thread | sort -n | head
      0: thread started
      1: thread started
      2: thread started
      3: thread started
      ...

      While counting the lines output by the kernel, I noticed it seems the root of the problem could be that the OpenCL printf() implementation is limited to printing at most 65536 lines (per kernel invocation?). The output would be different depending on the local work size because in one case the first work groups would be scheduled differently (eg. scheduled AFTER 65536 lines have been printed out) and this would be why their output is missing. In my original real-world program it seems that if a thread attempts to printf() a line after 65536 lines have already been printed out, then it silently dies/stops.


      System details:
      Ubuntu 16.04, 64-bit
      AMDGPU-PRO driver 16.60.3
      AMD APP SDK 3.0
      Radeon RX 480