14 Replies Latest reply on Jun 5, 2015 1:58 AM by dipak

    possible OpenCl compiler bug

    martinb

      Hi,

       

      I have come across a possible compiler issue which is present in the attached kernels. The two kernels differ how work is mapped to the threads. Requiring a work group size of 256; the difference in the kernels is

       

      210,211c210,211

      <     int divIdx = get_local_id(0) / 8;

      <     int modIdx = get_local_id(0) % 8;

      ---

      >     int divIdx = get_local_id(0) % 32;

      >     int modIdx = get_local_id(0) / 32;

       

      For the first version i get correct reproducible results from an Radeon 6550 for the PowerBornBBKernel, as well as an r290x. For the latter version i still get correct reproducible results from the 6550. However i get incorrect results from r290x which differ between different runs of the kernel with the same input.

       

      Name:                                      Hawaii
        Vendor:                                    Advanced Micro Devices, Inc.
        Device OpenCL C version:                   OpenCL C 1.2
        Driver version:                            1573.4 (VM)
        Profile:                                   FULL_PROFILE
        Version:                                   OpenCL 1.2 AMD-APP (1573.4)

       

        Name:                                      BeaverCreek
        Vendor:                                    Advanced Micro Devices, Inc.
        Device OpenCL C version:                   OpenCL C 1.2
        Driver version:                            1214.3
        Profile:                                   FULL_PROFILE
        Version:                                   OpenCL 1.2 AMD-APP (1214.3)
        • Re: possible OpenCl compiler bug
          dipak

          Hi,

          Please provide a test project (with host code and if required, data set) such that we can run these two kernels at our end and verify the results [also suggest how to compare the results (i.e correct or not)]. A simpler test-case with same behavior would be very helpful for us. Please also let us know other setup details such as OS, SDK, Driver version etc..

           

          Regards,

            • Re: Re: possible OpenCl compiler bug
              martinb

              Hi,

               

              Due to licensing i am not able to share the original code and I havent been able to extract a test case with the same behaviour so far. I have a more detailed output from the correct and incorrect running versions, which you may want to have a look at meanwhile, because i dont really understand the behaviour i observe there. The output comes from two additional print statements in the kernel, see the attach modified kernels. From the output it seems that either a part of the work group with modIdx == 6 does not execute, or that the other part does not respect the barriers.

               

              Also note that the code fails to run Test 1, but passes Test 2 using the same kernel file.

               

              The system producing the incorrect results runs:

              lsb_release -a

              LSB Version:    :core-4.1-amd64:core-4.1-noarch

              Distributor ID: CentOS

              Description:    CentOS Linux release 7.0.1406 (Core)

              Release:        7.0.1406

              Codename:       Core

               

                Name:                                     Hawaii
                Vendor:                                   Advanced Micro Devices, Inc.
                Device OpenCL C version:                  OpenCL C 1.2
                Driver version:                           1573.4 (VM)
                Profile:                                  FULL_PROFILE
                Version:                                  OpenCL 1.2 AMD-APP (1573.4)

               

              fglrxinfo -display

              display:   screen: 0

              OpenGL vendor string: Advanced Micro Devices, Inc.

              OpenGL renderer string: AMD Radeon R9 200 Series

              OpenGL version string: 4.4.13084 Compatibility Profile Context 14.301.1001

                • Re: possible OpenCl compiler bug
                  mrrvlad

                  here is  "bare bones" version of the kernel. I hope It does not have an obvious sync bug... and repros on your end. the nested ifs and this kind of thread assignment are used to fight generous register usage in original kernel.

                  I expect the resulting array to have values close to 255, and in fact it contains mostly 17 and 19, with some rare cases of other combinations.

                   

                  run this with offset 0 global size 256, local size 256

                   

                  I've tried it on a 260x win7 64bit machine with recent codeXL and catalyst packages.

                  code xl 1.5.6571.0

                  catalyst: 14.9 - downloaded from amd.com today.

                   

                  __kernel void

                  BBKernel(__global float* result)

                  {

                      __local float4 level3[64];

                      int threadIdx = get_local_id(0);

                      if (threadIdx < 64)

                      {

                          level3[threadIdx].x = 0;

                          level3[threadIdx].y = 0;

                          level3[threadIdx].z = 0;

                          level3[threadIdx].w = 0;

                      }

                      barrier(CLK_LOCAL_MEM_FENCE| CLK_GLOBAL_MEM_FENCE);

                   

                      const unsigned int atomCnt = 288;

                   

                      int divIdx = (threadIdx & 0x1f);

                      int modIdx = (threadIdx  >> 5);

                      for(unsigned int atomid = 0; atomid < atomCnt; atomid += 32)

                      {

                          if (modIdx <4)

                          {

                              if (modIdx <2)

                              {

                                  if (modIdx == 0)

                                      level3[divIdx].x = 1;

                                  else

                                      level3[divIdx].y = 2;

                              }

                              else

                              {

                                  if (modIdx == 2)

                                      level3[divIdx].z = 4;

                                  else

                                      level3[divIdx].w = 8;

                              }

                          }

                          else

                          {

                              if (modIdx < 6)

                              {

                                  if (modIdx == 4)

                                      level3[32 + divIdx].x = 16;

                                  else

                                      level3[32 + divIdx].y = 32;

                              }

                              else

                              {

                                  if (modIdx == 6)

                                      level3[32 + divIdx].z = 64;

                                  else

                                      level3[32 + divIdx].w = 128;

                              }

                          }

                          barrier(CLK_LOCAL_MEM_FENCE| CLK_GLOBAL_MEM_FENCE);

                          if (threadIdx < 32)

                          {

                              float br = ( level3[threadIdx].x

                                  + level3[threadIdx].y

                                  + level3[threadIdx].z

                                  + level3[threadIdx].w

                                  + level3[32 + (threadIdx)].x

                                  + level3[32 + (threadIdx)].y

                                  + level3[32 + (threadIdx)].z

                                  + level3[32 + (threadIdx)].w);

                   

                              result[atomid + threadIdx] = br;

                          }

                          barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

                      } 

                  }

                • Re: Re: possible OpenCl compiler bug
                  martinb

                  Hi,

                   

                  We finally have a reproducible test case with host code which produces the expected results on the CPU, but produces incorrect results on the GPU. Interestingly, the thread distribution

                   

                  int divIdx = get_local_id(0) / 8;
                  int modIdx = get_local_id(0) % 8;

                   

                  which works for the larger complete kernel, also fails the simple test case.

                • Re: possible OpenCl compiler bug
                  dipak

                  Hi,

                  Here is an update about this issue. The engg. team have identified root cause of the issue. The problem is due to a race condition where multiple work items modify the same vector at the same time. The compiler does optimizations under the assumption that there is no race condition. If a program contains race condition, the result is undefined based on what optimization is done. It may work or may not. In order to avoid such race condition, one should use some kind of prevention such as volatile type, atomic_store, or other atomic methods etc. So, please modify the code accordingly.

                   

                  Regards,