11 Replies Latest reply on Mar 4, 2014 11:36 PM by prao

    Private memory corruption?

    yoyo

      Hi.

       

      I'm using version 13.9 of Catalyst drivers on Linux.

      I've got an OpenCL program that was tested on an Nvidia GPU and works quite well, but it behaves strangely when I'm trying to run it on Radeon HD 7970. That is, when I pass by value an argument to a kernel that uses a private array, at some point argument's private copy becomes corrupted. I can reproduce it with the following simple code:

       

      #include <iostream>
      #include <string>
      #include <CL/cl.hpp>
      
      using std::cout;
      using std::endl;
      using std::string;
      
      typedef struct
      {
        double x[2];
      } DDouble;
      
      int main()
      {
        VECTOR_CLASS<cl::Platform> platforms;
        cl::Platform::get(&platforms);
        cl::Platform pl = platforms[0];
      
        VECTOR_CLASS<cl::Device> devices;
        pl.getDevices(CL_DEVICE_TYPE_ALL, &devices);
        cl::Device dev = devices[0];
      
        string tmp;
        pl.getInfo(CL_PLATFORM_NAME, &tmp);
        cout << "Platform: " << tmp << endl;
      
        dev.getInfo(CL_DEVICE_NAME, &tmp);
        cout << "Device: " << tmp << endl;
      
        cl::Context context = cl::Context(VECTOR_CLASS<cl::Device>(1, dev));
        cl::CommandQueue queue = cl::CommandQueue(context, dev);
      
        string src_string =
          "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
          "typedef struct { double x[2]; } DDouble;      \n"
          "                                              \n"
          "#define N 12                                  \n"
          "__kernel void A(__global double* p,           \n"
          "                DDouble dd)                   \n"
          "{                                             \n"
          "  double tmp[N];                              \n"
          "                                              \n"
          "  tmp[0] = 3.14159;                           \n"
          "  for(int i = 1; i < N; i++)                  \n"
          "     tmp[i] = tmp[i-1]+1.0/tmp[i-1];          \n"
          "                                              \n"
          "  p[0] = tmp[N-1];                            \n"
          "  p[1] = dd.x[0];                             \n"
          "}                                             \n";
      
        cl::Program::Sources src(1, std::make_pair(src_string.c_str(),
                                                   src_string.size()));
        cl::Program program(context, src);
        program.build(VECTOR_CLASS<cl::Device>(1, dev));
      
        string buildLog;
        program.getBuildInfo(dev, CL_PROGRAM_BUILD_LOG,
                             &buildLog);
        cout << "Build log:" << endl
             << " ******************** " << endl
             << buildLog << endl
             << " ******************** " << endl;
      
        cl::Kernel kernel(program, "A");
      
        DDouble dd;
        dd.x[0] = 1.2345;
        dd.x[1] = 5.4321;
      
        cl::Buffer buff(context, CL_MEM_READ_WRITE, 2*sizeof(double));
      
        kernel.setArg(0, buff);
        kernel.setArg(1, dd);
        queue.enqueueTask(kernel);
      
        queue.finish();
      
        double* map = (double*)queue.enqueueMapBuffer(buff, CL_TRUE,
                                                      CL_MAP_READ, 0,
                                                      2*sizeof(double));
        cout << "p[0] = " << map[0] << endl
             << "p[1] = " << map[1] << endl;
        queue.enqueueUnmapMemObject(buff, map);
      
        return 0;
      }
      
      

       

      Output:

       

      Platform: AMD Accelerated Parallel Processing
      Device: Tahiti
      Build log:
      ********************
      "/tmp/OCLjM72Hc.cl", line 1: warning: OpenCL extension is now part of core
        #pragma OPENCL EXTENSION cl_khr_fp64 : enable
                                 ^
      
      
      ********************
      p[0] = 5.69946
      p[1] = 5.13595
      
      

       

      I was, of course, expecting to get p[1]=1.2345. It gives correct results for N=11, however.

       

      My questions are:

      1. Is it a compiler's bug or an intended behaviour?

      2. Is there a way to make compiler issue a warning when it's going to produce a code that overwrites data that should remain unchanged? I can rewrite some of my kernels to use more local memory instead of private, but I want to be sure that if kernel compiles without warnings it will run correctly.

        • Re: Private memory corruption?
          nou

          Even simpler kernel show this bug

          "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"  
          "typedef struct { double x,y; } DDouble;      \n"  
          "                                              \n"  
          "__kernel void A(__global double* p,           \n"  
          "                __private DDouble dd)         \n"  
          "{                                             \n"   
          "                                              \n"  
          "  p[0] = dd.x;                                \n"
          "  p[1] = dd.y;                                \n"  
          "}                                             \n"; 
          

          If you change double in struct definition to float it is working as intended. I tested it only on CPU device.

            • Re: Re: Private memory corruption?
              yoyo

              Can't reproduce -- your kernel works fine on my GPU.

              Also the following version seems to be unaffected:

                string src_string =
                  "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
                  "                                              \n"
                  "#define N 12                                  \n"
                  "__kernel void A(__global double* p,           \n"
                  "                double2 dd)                   \n"
                  "{                                             \n"
                  "  double tmp[N];                              \n"
                  "                                              \n"
                  "  tmp[0] = 3.14159;                           \n"
                  "  for(int i = 1; i < N; i++)                  \n"
                  "     tmp[i] = tmp[i-1]+1.0/tmp[i-1];          \n"
                  "                                              \n"
                  "  p[0] = tmp[N-1];                            \n"
                  "  p[1] = dd.s0;                               \n"
                  "}                                             \n";
              
            • Re: Private memory corruption?
              drallan
              ******************** 
              1. p[0] = 5.69946
              2. p[1] = 5.13595

               

              I was, of course, expecting to get p[1]=1.2345. It gives correct results for N=11, however.

               

              My questions are:

              1. Is it a compiler's bug or an intended behaviour?

              2. Is there a way to make compiler issue a warning when it's going to produce a code that overwrites data that should remain unchanged? I can rewrite some of my kernels to use more local memory instead of private, but I want to be sure that if kernel compiles without warnings it will run correctly.

               

              I looks like a a compiler bug. The value 5.13595 is a result from about loop N=9, stored in register v[0:1].

              When the loop is present, the compiler "forgets" to load dd.x[1] into v[0:1] and then writes v[0:1] to location p[1].

               

              Below is the assembly code showing the compiler error. The last time v[0:1] is loaded it contains value from loop 9, the last two statements at the end write to p[0] and p[1]. v[0:1] is written to p[1]. The second code box shows the program with out the loop, and dd.x[0] is loaded and saved correctly.

               

              //------------------------------------------------------------------------------------
              // Last part of program, double version
              //------------------------------------------------------------------------------------
              
              //   { .... CODE .... }
              
                v_ldexp_f64   v[5:6], 1.0, v5                             // 000007AC: D2D00005 00020AF2
                v_fma_f64     v[3:4], v[3:4], v[5:6], v[9:10]             // 000007B4: D2980003 04260B03
                v_ldexp_f64   v[2:3], v[3:4], v2                          // 000007BC: D2D00002 00020503
                v_div_fixup_f64  v[2:3], v[2:3], v[0:1], 1.0              // 000007C4: D2C00002 03CA0102
              //--------------------------------------------------------------------
               //   v[0:1] loaded with  division result loop N-2
              //--------------------------------------------------------------------
                v_add_f64     v[0,1], v[0:1], v[2:3]     <--------        // 000007CC: D2C80000 00020500
                v_frexp_exp_i32_f64  v[2:3], v[0:1]                       // 000007D4: 7E047900
                v_frexp_mant_f64  v[3:4], v[0:1]                          // 000007D8: 7E067B00
                v_and_b32     v4, 0x7fffffff, v4                          // 000007DC: 360808FF 7FFFFFFF
                v_rcp_f64     v[5:6], v[3:4]                              // 000007E4: 7E0A5F03
                v_fma_f64     v[7:8], -v[3:4], v[5:6], 1.0                // 000007E8: D2980007 23CA0B03
                v_fma_f64     v[5:6], v[5:6], v[7:8], v[5:6]              // 000007F0: D2980005 04160F05
                v_fma_f64     v[7:8], -v[3:4], v[5:6], 1.0                // 000007F8: D2980007 23CA0B03
                v_fma_f64     v[5:6], v[5:6], v[7:8], v[5:6]              // 00000800: D2980005 04160F05
                v_fma_f64     v[7:8], v[3:4], v[5:6], -2.0                // 00000808: D2980007 03D60B03
                v_mul_f64     v[5:6], v[5:6], -v[7:8]                     // 00000810: D2CA0005 40020F05
                v_fma_f64     v[7:8], v[3:4], v[5:6], -2.0                // 00000818: D2980007 03D60B03
                v_mul_f64     v[5:6], v[5:6], -v[7:8]                     // 00000820: D2CA0005 40020F05
                v_mul_f64     v[7:8], v[5:6], 0.5                         // 00000828: D2CA0007 0001E105
                v_fma_f64     v[9:10], v[3:4], -v[7:8], 0.5               // 00000830: D2980009 43C20F03
                v_fma_f64     v[7:8], v[5:6], v[9:10], v[7:8]             // 00000838: D2980007 041E1305
                v_fma_f64     v[9:10], v[3:4], -v[5:6], 1.0               // 00000840: D2980009 43CA0B03
                v_mul_f64     v[9:10], v[9:10], v[5:6]                    // 00000848: D2CA0009 00020B09
                v_fma_f64     v[3:4], v[3:4], -v[7:8], 0.5                // 00000850: D2980003 43C20F03
                v_mul_f64     v[9:10], v[9:10], v[3:4]                    // 00000858: D2CA0009 00020709
                v_fma_f64     v[3:4], v[5:6], v[3:4], v[9:10]             // 00000860: D2980003 04260705
                v_or_b32      v4, 0x00040000, v4                          // 00000868: 380808FF 00040000
                v_sub_i32     v2, vcc, 1, v2                              // 00000870: 4C040481
                v_min_i32     v5, 0x00000300, v2                          // 00000874: 220A04FF 00000300
                v_sub_i32     v6, vcc, 0, v5                              // 0000087C: 4C0C0A80
                v_sub_i32     v2, vcc, v2, v5                             // 00000880: 4C040B02
                v_ldexp_f64   v[9:10], v[7:8], v5                         // 00000884: D2D00009 00020B07
                v_ldexp_f64   v[11:12], v[9:10], v6                       // 0000088C: D2D0000B 00020D09
                v_add_f64     v[6:7], v[7:8], -v[11:12]                   // 00000894: D2C80006 40021707
                v_add_f64     v[3:4], v[6:7], v[3:4]                      // 0000089C: D2C80003 00020706
                v_ldexp_f64   v[5:6], 1.0, v5                             // 000008A4: D2D00005 00020AF2
                v_fma_f64     v[3:4], v[3:4], v[5:6], v[9:10]             // 000008AC: D2980003 04260B03
                v_ldexp_f64   v[2:3], v[3:4], v2                          // 000008B4: D2D00002 00020503
                v_div_fixup_f64  v[2:3], v[2:3], v[0:1], 1.0              // 000008BC: D2C00002 03CA0102
              
              //   { .... CODE .... }
              
              //--------------------------------------------------------------------
               //   v[0:1] last reference to v[0:1] 
              //--------------------------------------------------------------------
              
                v_add_f64     v[2:3], v[0:1], v[2:3]                      // 000008C4: D2C80002 00020500
                v_frexp_exp_i32_f64  v[4:5], v[2:3]                       // 000008CC: 7E087902
                v_frexp_mant_f64  v[5:6], v[2:3]                          // 000008D0: 7E0A7B02
                v_and_b32     v6, 0x7fffffff, v6                          // 000008D4: 360C0CFF 7FFFFFFF
                v_rcp_f64     v[7:8], v[5:6]                              // 000008DC: 7E0E5F05
                v_fma_f64     v[9:10], -v[5:6], v[7:8], 1.0               // 000008E0: D2980009 23CA0F05
                v_fma_f64     v[7:8], v[7:8], v[9:10], v[7:8]             // 000008E8: D2980007 041E1307
                v_fma_f64     v[9:10], -v[5:6], v[7:8], 1.0               // 000008F0: D2980009 23CA0F05
                v_fma_f64     v[7:8], v[7:8], v[9:10], v[7:8]             // 000008F8: D2980007 041E1307
                v_fma_f64     v[9:10], v[5:6], v[7:8], -2.0               // 00000900: D2980009 03D60F05
                v_mul_f64     v[7:8], v[7:8], -v[9:10]                    // 00000908: D2CA0007 40021307
                v_fma_f64     v[9:10], v[5:6], v[7:8], -2.0               // 00000910: D2980009 03D60F05
                v_mul_f64     v[7:8], v[7:8], -v[9:10]                    // 00000918: D2CA0007 40021307
                v_mul_f64     v[9:10], v[7:8], 0.5                        // 00000920: D2CA0009 0001E107
                v_fma_f64     v[11:12], v[5:6], -v[9:10], 0.5             // 00000928: D298000B 43C21305
                v_fma_f64     v[9:10], v[7:8], v[11:12], v[9:10]          // 00000930: D2980009 04261707
                v_fma_f64     v[11:12], v[5:6], -v[7:8], 1.0              // 00000938: D298000B 43CA0F05
                v_mul_f64     v[11:12], v[11:12], v[7:8]                  // 00000940: D2CA000B 00020F0B
                v_fma_f64     v[5:6], v[5:6], -v[9:10], 0.5               // 00000948: D2980005 43C21305
              
              //   { .... CODE .... }
              
                v_fma_f64     v[5:6], v[7:8], v[5:6], v[11:12]            // 00000A50: D2980005 042E0B07
                v_or_b32      v6, 0x00040000, v6                          // 00000A58: 380C0CFF 00040000
                v_sub_i32     v4, vcc, 1, v4                              // 00000A60: 4C080881
                v_min_i32     v7, 0x00000300, v4                          // 00000A64: 220E08FF 00000300
                v_sub_i32     v8, vcc, 0, v7                              // 00000A6C: 4C100E80
                v_sub_i32     v4, vcc, v4, v7                             // 00000A70: 4C080F04
                v_ldexp_f64   v[11:12], v[9:10], v7                       // 00000A74: D2D0000B 00020F09
                v_ldexp_f64   v[13:14], v[11:12], v8                      // 00000A7C: D2D0000D 0002110B
                v_add_f64     v[8:9], v[9:10], -v[13:14]                  // 00000A84: D2C80008 40021B09
                v_add_f64     v[5:6], v[8:9], v[5:6]                      // 00000A8C: D2C80005 00020B08
                v_ldexp_f64   v[7:8], 1.0, v7                             // 00000A94: D2D00007 00020EF2
                v_fma_f64     v[5:6], v[5:6], v[7:8], v[11:12]            // 00000A9C: D2980005 042E0F05
                v_ldexp_f64   v[4:5], v[5:6], v4                          // 00000AA4: D2D00004 00020905
                v_div_fixup_f64  v[4:5], v[4:5], v[2:3], 1.0              // 00000AAC: D2C00004 03CA0504
                v_add_f64     v[2:3], v[2:3], v[4:5]                      // 00000AB4: D2C80002 00020902
                s_waitcnt     lgkmcnt(0)                                  // 00000ABC: BF8C007F
                v_mov_b32     v4, s0                                      // 00000AC0: 7E080200
              //-----------------------------------------------------------------------------------
              // writes p[0] from v[2:3]
              //  fails to load v[0:1] with dd.x[0];
              //  writes v[0:1] with result of loop N-2 (from above) to p[1]
              //-----------------------------------------------------------------------------------
                tbuffer_store_format_xy  v[2:3], v4, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
                tbuffer_store_format_xy  v[0:1], v4, s[4:7], 0 offen offset:8 format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
                s_endpgm                                                  // 00000AD4: BF810000
              

               

              Assembly code with the loop commented out. Compiler correctly loads dd.x[0] and stores to p[1].

               

                s_buffer_load_dword  s0, s[8:11], 0x00                    // 00000000: C2000900
                s_buffer_load_dwordx4  s[8:11], s[8:11], 0x04             // 00000004: C2840904
                s_waitcnt     lgkmcnt(0)                                  // 00000008: BF8C007F
                v_mov_b32     v0, s0                                      // 0000000C: 7E000200
                v_mov_b32     v1, s8                                      // 00000010: 7E020208
                v_mov_b32     v2, s9                                      // 00000014: 7E040209
                tbuffer_store_format_xy  v[1:2], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000018: EBDD1010 80010100
                v_mov_b32     v0, s0                                      // 00000020: 7E000200
                s_waitcnt     expcnt(0)                                   // 00000024: BF8C1F0F
                v_lshl_b64    v[1:2], 0, 0                                // 00000028: D2C20001 00010080
                tbuffer_store_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBDD1000 80010100
                s_endpgm                                                  /
              
              
              • Re: Re: Private memory corruption?
                yoyo

                • Re: Private memory corruption?
                  prao

                  Hi,

                   

                  I doubt if the way the DDouble structure is passed to the kernel is correct. Its a structure type. You would have to pass it as a pointer either with __global or __constant (as its read only here). With this change I see it works correctly.

                   

                  string src_string = 
                          "//#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" 
                          "typedef struct { double x[2]; } DDouble;      \n" 
                          "                                              \n" 
                          "#define N 13                                  \n" 
                          "__kernel void A(__global double* p,           \n" 
                          "                __global DDouble* dd)                   \n" 
                          "{                                             \n" 
                          "  double tmp[N];                              \n" 
                          "                                              \n" 
                          "  tmp[0] = 3.14159;                           \n" 
                          "  for(int i = 1; i < N; i++)                  \n" 
                          "     tmp[i] = tmp[i-1]+1.0/tmp[i-1];          \n" 
                          "                                              \n" 
                          "  p[0] = tmp[N-1];                            \n" 
                  "  p[1] = dd->x[0];                             \n" 
                          "}                                             \n"; 


                  DDouble dd; 
                  dd.x[0] = 1.2345; 
                  dd.x[1] = 5.4321; 

                  cl::Buffer cldd(context, CL_MEM_READ_ONLY, sizeof(DDouble));
                  queue.enqueueWriteBuffer(cldd, CL_TRUE, 0,sizeof(DDouble), &dd, NULL, NULL);

                   

                  cl::Buffer buff(context, CL_MEM_READ_WRITE, 2*sizeof(double)); 

                  kernel.setArg(0, buff); 
                  kernel.setArg(1, cldd); 
                  queue.enqueueTask(kernel); 

                   

                  Another option is to use cl_double2 in which case you don't need to use the address qualifier. Verified that this also works fine.

                   

                  Regards

                  Pradeep

                    • Re: Private memory corruption?
                      yoyo

                      I doubt if the way the DDouble structure is passed to the kernel is correct. Its a structure type. You would have to pass it as a pointer either with __global or __constant (as its read only here).

                      Sorry, can't see your argument. From what I know about C language, there is no much difference between passing a built-in type and passing a type, defined with typedef. Neither I was able to find a place in OpenCL specification saying that I can't pass a structure as a private argument. Please correct me if I'm wrong. Also it somehow works for N <= 11.

                      Unfortunately, your code seems irrelevant to the issue, as the bug concerns private data integrity, not the ability to read from global memory.

                        • Re: Private memory corruption?
                          prao

                          I tried another experiment today. By changing the datatype of the private variable "tmp" declared inside the kernel from double to float, the sample works fine for N>11 too. So possibly the issue has to do with exceeding the private memory limit per work-item.That might be the reason why it works till N <=11 when "tmp" type is double. Anyway, I will try to get a clarification on this.

                           

                          Regards

                          Pradeep