9 Replies Latest reply on Apr 19, 2013 8:36 AM by himanshu.gautam

    Bug in AMD OpenCL implementation?

    ddemidov

      Hello,

       

      I am not sure if this is the right place for OpenCL bug reports, so please forgive me if I am wrong. Here is the link to the simple program that should add two vectors multiple times: https://gist.github.com/ddemidov/5398174. The source is also attached here for convenience.

       

      This simple program, when compiled with

       

          g++ -std=c++0x -o vector_sum vector_sum.cpp -lOpenCL

       

      outputs 4096 == 4096 on NVIDIA and Intel OpenCL implementations. When, however, it is executed on AMD GPUs (the ones I tested are HD 7970 'Tahiti' and HD 7770 'Capeverde'), it may output 4096 == 4081, 4096 == 4082, or something else.

       

      Adding call to cl::CommandQueue::finish() after each kernel launch (but not after the complete loop) solves the issue, but should be unnecessary according to standard.

       

      Replacing definition of global_size at line 99 with

       

          size_t global_size = alignup(N, workgroup_size);

       

      also helps, but is equally unnecessary.

       

      The current operating system is Gentoo linux, kernel version 3.7.1. ati-drivers package has version 13.1. But I have observed this behavior on several machines for several consecutive versions of ati-drivers (and several linux kernels).

       

      Is this a bug in AMD OpenCL, or am I doing something wrong?

        • Re: Bug in AMD OpenCL implementation?
          dmeiser

          You have a race condition in your code. Each thread in the opencl kernel loops over all entries and the array A and writes to that same array (both pointers a and c alias A). The problem is that these updates of A are not atomic. The following is possible:

          - Thread 1 reads from a[0]

          - Thread 2 reads from a[0]

          - Thread 1 writes a[0]

          - Thread 2 writes a[0]

           

          This would result in a[0] ==1 rather than a[0]==2.

           

          The bottom line is that the behavior of your code is undefined.

            • Re: Bug in AMD OpenCL implementation?
              ddemidov

              But each thread read single element from input arrays (A, and B), and writes same element to the A. The stride is equal to the global size so I don't see any race conditions here.

               

              So,

              thread 1 reads from a[0]

              thread 2 reads from a[1]

              thread 1 writes to a[0]

              thread 2 writes to a[1]

                • Re: Bug in AMD OpenCL implementation?
                  dmeiser

                  But that's not the code you wrote. thread1 does not only write to a[0], it writes to a[0], a[1], ...

                   

                  Did you mean to write the following in your kernel?

                   

                          cl::Program program = build_program(context, device,

                              "kernel void add(\n"

                              "       ulong n,\n"

                              "       global const int *a,\n"

                              "       global const int *b,\n"

                              "       global int *c\n"

                              "       )\n"

                              "{\n"

                              "    size_t i = get_global_id(0);"

                              "     c[i] = a[i] + b[i];\n"

                              "}\n"

                              );

                    • Re: Bug in AMD OpenCL implementation?
                      ddemidov

                      Well, here is the code I wrote:

                       

                      for(size_t i = get_global_id(0); i < n; i += get_global_size(0)) {

                                  c[i] = a[i] + b[i];

                      }

                       

                      So i=0 for thread 0, i = 1 for thread 1, etc. This is almost equivalent to what you wrote, only I check for array overrun and each workgroup does several chunks of the arrays (and the distance between the chunks is get_global_size(0)).

                       

                      edit: please mind the 'i += get_global_size(0)' part.

                        • Re: Bug in AMD OpenCL implementation?
                          himanshu.gautam

                          A few issues I found:

                          __global const int *a,\n"

                          __global const int *b,\n"

                          __global int *c\n"

                          Here you are using same memory for buffer a & c from host side. I would expect a compile time error, but probably compiler is not able to catch it, because of different names for the same buffer.

                          EDIT: Compiler cannot catch this issue, as it has no idea to what a,b,c are set using clSetKernelArg.

                           

                          Also I found the queue.finish() command commented in the code. Check this code, it should give right results.

                           

                          Message was edited by: Himanshu Gautam

                            • Re: Bug in AMD OpenCL implementation?
                              ddemidov

                              Hello,

                               

                              Thank you for the answer. The const qualifiers should not be a problem. But anyway, the code is the result of my attempt to construct a minimal example that shows the issue. I have originally met this behavior in VexCL library I develop (https://github.com/ddemidov/vexcl). OpenCL kernels are generated automatically there and global kernel parameters are passed without const qualifiers there anyway. For example, here is a real kernel that I also observe the problem with:

                               

                              #if defined(cl_khr_fp64)

                              #  pragma OPENCL EXTENSION cl_khr_fp64: enable

                              #elif defined(cl_amd_fp64)

                              #  pragma OPENCL EXTENSION cl_amd_fp64: enable

                              #endif

                              kernel void plus_term_plus_term_multiplies_term_term_(

                                        ulong n,

                                        global double *res,

                                        global double *prm_1_1,

                                        global double *prm_1_2,

                                        global double *prm_1_3,

                                        global double *prm_1_4

                              )

                              {

                                        for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {

                                                  res[idx] = ( prm_1_1[idx] + ( prm_1_2[idx] + ( prm_1_3[idx] * prm_1_4[idx] ) ) );

                                        }

                              }

                               

                              The kernel is constructed as a result of this expression: https://github.com/ddemidov/vexcl/blob/master/examples/benchmark.cpp#L50

                               

                              Another example of this behavior is discussed here: https://github.com/ddemidov/vexcl/commit/3d23602498172040e6f557d2d911c7ecb940e45a#commitcomment-2998741.

                               

                              Single queue.finish() at the end of the loop does not help (and anyway, the queue is created with default parameters, so reading buffer to host memory should only happen after all preceding kernels have completed).

                               

                              I have to say that right now I can not reproduce the issue with my original code. I rebooted my test machine yesterday, so this behavior probably is a result of a GPU entering some buggy state.

                                • Re: Bug in AMD OpenCL implementation?
                                  ddemidov

                                  An update:

                                   

                                  I login to the test machine by ssh. By default, DISPLAY environment variable is not exported, and I could not reproduce the problem. If, however, I do export DISPLAY=:0 (there is an X session running on the box locally), the problem reappears.

                                   

                                  Himanishu, I've tried both of your suggestions (removing const qualifiers and adding queue.finish() between the loop and buffer read), and, as I said, the problem is still present.

                                    • Re: Bug in AMD OpenCL implementation?
                                      himanshu.gautam

                                      I was able to reproduce your issue with 13.1 driver on a tahiti Ubuntu 64 bit machine. But after that I installed the internal driver. And I am not able to reproduce any failure using ssh on a linux machine. I tried with/without exporting the DISPLAY variable.

                                      So I think the issue no longer exists with internal drivers. Expect the bug to be fixed with the new driver release.

                                • Re: Bug in AMD OpenCL implementation?
                                  dmeiser

                                  I see. I missed the += get_global_size(0).