cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

yoyo
Journeyman III

Private memory corruption?

Hi.

I'm using version 13.9 of Catalyst drivers on Linux.

I've got an OpenCL program that was tested on an Nvidia GPU and works quite well, but it behaves strangely when I'm trying to run it on Radeon HD 7970. That is, when I pass by value an argument to a kernel that uses a private array, at some point argument's private copy becomes corrupted. I can reproduce it with the following simple code:


#include <iostream>


#include <string>


#include <CL/cl.hpp>



using std::cout;


using std::endl;


using std::string;



typedef struct


{


  double x[2];


} DDouble;



int main()


{


  VECTOR_CLASS<cl::Platform> platforms;


  cl::Platform::get(&platforms);


  cl::Platform pl = platforms[0];



  VECTOR_CLASS<cl::Device> devices;


  pl.getDevices(CL_DEVICE_TYPE_ALL, &devices);


  cl::Device dev = devices[0];



  string tmp;


  pl.getInfo(CL_PLATFORM_NAME, &tmp);


  cout << "Platform: " << tmp << endl;



  dev.getInfo(CL_DEVICE_NAME, &tmp);


  cout << "Device: " << tmp << endl;



  cl::Context context = cl::Context(VECTOR_CLASS<cl::Device>(1, dev));


  cl::CommandQueue queue = cl::CommandQueue(context, dev);



  string src_string =


    "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"


    "typedef struct { double x[2]; } DDouble;      \n"


    "                                              \n"


    "#define N 12                                  \n"


    "__kernel void A(__global double* p,           \n"


    "                DDouble dd)                   \n"


    "{                                             \n"


    "  double tmp;                              \n"


    "                                              \n"


    "  tmp[0] = 3.14159;                           \n"


    "  for(int i = 1; i < N; i++)                  \n"


    "     tmp = tmp[i-1]+1.0/tmp[i-1];          \n"


    "                                              \n"


    "  p[0] = tmp[N-1];                            \n"


    "  p[1] = dd.x[0];                             \n"


    "}                                             \n";



  cl::Program::Sources src(1, std::make_pair(src_string.c_str(),


                                             src_string.size()));


  cl::Program program(context, src);


  program.build(VECTOR_CLASS<cl::Device>(1, dev));



  string buildLog;


  program.getBuildInfo(dev, CL_PROGRAM_BUILD_LOG,


                       &buildLog);


  cout << "Build log:" << endl


       << " ******************** " << endl


       << buildLog << endl


       << " ******************** " << endl;



  cl::Kernel kernel(program, "A");



  DDouble dd;


  dd.x[0] = 1.2345;


  dd.x[1] = 5.4321;



  cl::Buffer buff(context, CL_MEM_READ_WRITE, 2*sizeof(double));



  kernel.setArg(0, buff);


  kernel.setArg(1, dd);


  queue.enqueueTask(kernel);



  queue.finish();



  double* map = (double*)queue.enqueueMapBuffer(buff, CL_TRUE,


                                                CL_MAP_READ, 0,


                                                2*sizeof(double));


  cout << "p[0] = " << map[0] << endl


       << "p[1] = " << map[1] << endl;


  queue.enqueueUnmapMemObject(buff, map);



  return 0;


}



Output:


Platform: AMD Accelerated Parallel Processing


Device: Tahiti


Build log:


********************


"/tmp/OCLjM72Hc.cl", line 1: warning: OpenCL extension is now part of core


  #pragma OPENCL EXTENSION cl_khr_fp64 : enable


                           ^




********************


p[0] = 5.69946


p[1] = 5.13595



I was, of course, expecting to get p[1]=1.2345. It gives correct results for N=11, however.

My questions are:

1. Is it a compiler's bug or an intended behaviour?

2. Is there a way to make compiler issue a warning when it's going to produce a code that overwrites data that should remain unchanged? I can rewrite some of my kernels to use more local memory instead of private, but I want to be sure that if kernel compiles without warnings it will run correctly.

0 Likes
11 Replies
nou
Exemplar

Even simpler kernel show this bug


"#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" 


"typedef struct { double x,y; } DDouble;      \n" 


"                                              \n" 


"__kernel void A(__global double* p,           \n" 


"                __private DDouble dd)         \n" 


"{                                             \n"  


"                                              \n" 


"  p[0] = dd.x;                                \n"


"  p[1] = dd.y;                                \n" 


"}                                             \n";


If you change double in struct definition to float it is working as intended. I tested it only on CPU device.

0 Likes
yoyo
Journeyman III

Can't reproduce -- your kernel works fine on my GPU.

Also the following version seems to be unaffected:


  string src_string =


    "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"


    "                                              \n"


    "#define N 12                                  \n"


    "__kernel void A(__global double* p,           \n"


    "                double2 dd)                   \n"


    "{                                             \n"


    "  double tmp;                              \n"


    "                                              \n"


    "  tmp[0] = 3.14159;                           \n"


    "  for(int i = 1; i < N; i++)                  \n"


    "     tmp = tmp[i-1]+1.0/tmp[i-1];          \n"


    "                                              \n"


    "  p[0] = tmp[N-1];                            \n"


    "  p[1] = dd.s0;                               \n"


    "}                                             \n";


0 Likes

my bug can be reproduced only with CPU and original SDK version of OpenCL. with catalyst version of OpenCL runtime it is working correctly.

0 Likes
drallan
Challenger

******************** 

  1. p[0] = 5.69946

  2. p[1] = 5.13595



I was, of course, expecting to get p[1]=1.2345. It gives correct results for N=11, however.



My questions are:


1. Is it a compiler's bug or an intended behaviour?


2. Is there a way to make compiler issue a warning when it's going to produce a code that overwrites data that should remain unchanged? I can rewrite some of my kernels to use more local memory instead of private, but I want to be sure that if kernel compiles without warnings it will run correctly.



I looks like a a compiler bug. The value 5.13595 is a result from about loop N=9, stored in register v[0:1].

When the loop is present, the compiler "forgets" to load dd.x[1] into v[0:1] and then writes v[0:1] to location p[1].

Below is the assembly code showing the compiler error. The last time v[0:1] is loaded it contains value from loop 9, the last two statements at the end write to p[0] and p[1]. v[0:1] is written to p[1]. The second code box shows the program with out the loop, and dd.x[0] is loaded and saved correctly.


//------------------------------------------------------------------------------------
// Last part of program, double version
//------------------------------------------------------------------------------------



//   { .... CODE .... }



  v_ldexp_f64   v[5:6], 1.0, v5                             // 000007AC: D2D00005 00020AF2
  v_fma_f64     v[3:4], v[3:4], v[5:6], v[9:10]             // 000007B4: D2980003 04260B03
  v_ldexp_f64   v[2:3], v[3:4], v2                          // 000007BC: D2D00002 00020503
  v_div_fixup_f64  v[2:3], v[2:3], v[0:1], 1.0              // 000007C4: D2C00002 03CA0102
//--------------------------------------------------------------------
//   v[0:1] loaded with  division result loop N-2
//--------------------------------------------------------------------
  v_add_f64     v[0,1], v[0:1], v[2:3]     <--------        // 000007CC: D2C80000 00020500
  v_frexp_exp_i32_f64  v[2:3], v[0:1]                       // 000007D4: 7E047900
  v_frexp_mant_f64  v[3:4], v[0:1]                          // 000007D8: 7E067B00
  v_and_b32     v4, 0x7fffffff, v4                          // 000007DC: 360808FF 7FFFFFFF
  v_rcp_f64     v[5:6], v[3:4]                              // 000007E4: 7E0A5F03
  v_fma_f64     v[7:8], -v[3:4], v[5:6], 1.0                // 000007E8: D2980007 23CA0B03
  v_fma_f64     v[5:6], v[5:6], v[7:8], v[5:6]              // 000007F0: D2980005 04160F05
  v_fma_f64     v[7:8], -v[3:4], v[5:6], 1.0                // 000007F8: D2980007 23CA0B03
  v_fma_f64     v[5:6], v[5:6], v[7:8], v[5:6]              // 00000800: D2980005 04160F05
  v_fma_f64     v[7:8], v[3:4], v[5:6], -2.0                // 00000808: D2980007 03D60B03
  v_mul_f64     v[5:6], v[5:6], -v[7:8]                     // 00000810: D2CA0005 40020F05
  v_fma_f64     v[7:8], v[3:4], v[5:6], -2.0                // 00000818: D2980007 03D60B03
  v_mul_f64     v[5:6], v[5:6], -v[7:8]                     // 00000820: D2CA0005 40020F05
  v_mul_f64     v[7:8], v[5:6], 0.5                         // 00000828: D2CA0007 0001E105
  v_fma_f64     v[9:10], v[3:4], -v[7:8], 0.5               // 00000830: D2980009 43C20F03
  v_fma_f64     v[7:8], v[5:6], v[9:10], v[7:8]             // 00000838: D2980007 041E1305
  v_fma_f64     v[9:10], v[3:4], -v[5:6], 1.0               // 00000840: D2980009 43CA0B03
  v_mul_f64     v[9:10], v[9:10], v[5:6]                    // 00000848: D2CA0009 00020B09
  v_fma_f64     v[3:4], v[3:4], -v[7:8], 0.5                // 00000850: D2980003 43C20F03
  v_mul_f64     v[9:10], v[9:10], v[3:4]                    // 00000858: D2CA0009 00020709
  v_fma_f64     v[3:4], v[5:6], v[3:4], v[9:10]             // 00000860: D2980003 04260705
  v_or_b32      v4, 0x00040000, v4                          // 00000868: 380808FF 00040000
  v_sub_i32     v2, vcc, 1, v2                              // 00000870: 4C040481
  v_min_i32     v5, 0x00000300, v2                          // 00000874: 220A04FF 00000300
  v_sub_i32     v6, vcc, 0, v5                              // 0000087C: 4C0C0A80
  v_sub_i32     v2, vcc, v2, v5                             // 00000880: 4C040B02
  v_ldexp_f64   v[9:10], v[7:8], v5                         // 00000884: D2D00009 00020B07
  v_ldexp_f64   v[11:12], v[9:10], v6                       // 0000088C: D2D0000B 00020D09
  v_add_f64     v[6:7], v[7:8], -v[11:12]                   // 00000894: D2C80006 40021707
  v_add_f64     v[3:4], v[6:7], v[3:4]                      // 0000089C: D2C80003 00020706
  v_ldexp_f64   v[5:6], 1.0, v5                             // 000008A4: D2D00005 00020AF2
  v_fma_f64     v[3:4], v[3:4], v[5:6], v[9:10]             // 000008AC: D2980003 04260B03
  v_ldexp_f64   v[2:3], v[3:4], v2                          // 000008B4: D2D00002 00020503
  v_div_fixup_f64  v[2:3], v[2:3], v[0:1], 1.0              // 000008BC: D2C00002 03CA0102



//   { .... CODE .... }



//--------------------------------------------------------------------
//   v[0:1] last reference to v[0:1]
//--------------------------------------------------------------------



  v_add_f64     v[2:3], v[0:1], v[2:3]                      // 000008C4: D2C80002 00020500
  v_frexp_exp_i32_f64  v[4:5], v[2:3]                       // 000008CC: 7E087902
  v_frexp_mant_f64  v[5:6], v[2:3]                          // 000008D0: 7E0A7B02
  v_and_b32     v6, 0x7fffffff, v6                          // 000008D4: 360C0CFF 7FFFFFFF
  v_rcp_f64     v[7:8], v[5:6]                              // 000008DC: 7E0E5F05
  v_fma_f64     v[9:10], -v[5:6], v[7:8], 1.0               // 000008E0: D2980009 23CA0F05
  v_fma_f64     v[7:8], v[7:8], v[9:10], v[7:8]             // 000008E8: D2980007 041E1307
  v_fma_f64     v[9:10], -v[5:6], v[7:8], 1.0               // 000008F0: D2980009 23CA0F05
  v_fma_f64     v[7:8], v[7:8], v[9:10], v[7:8]             // 000008F8: D2980007 041E1307
  v_fma_f64     v[9:10], v[5:6], v[7:8], -2.0               // 00000900: D2980009 03D60F05
  v_mul_f64     v[7:8], v[7:8], -v[9:10]                    // 00000908: D2CA0007 40021307
  v_fma_f64     v[9:10], v[5:6], v[7:8], -2.0               // 00000910: D2980009 03D60F05
  v_mul_f64     v[7:8], v[7:8], -v[9:10]                    // 00000918: D2CA0007 40021307
  v_mul_f64     v[9:10], v[7:8], 0.5                        // 00000920: D2CA0009 0001E107
  v_fma_f64     v[11:12], v[5:6], -v[9:10], 0.5             // 00000928: D298000B 43C21305
  v_fma_f64     v[9:10], v[7:8], v[11:12], v[9:10]          // 00000930: D2980009 04261707
  v_fma_f64     v[11:12], v[5:6], -v[7:8], 1.0              // 00000938: D298000B 43CA0F05
  v_mul_f64     v[11:12], v[11:12], v[7:8]                  // 00000940: D2CA000B 00020F0B
  v_fma_f64     v[5:6], v[5:6], -v[9:10], 0.5               // 00000948: D2980005 43C21305



//   { .... CODE .... }



  v_fma_f64     v[5:6], v[7:8], v[5:6], v[11:12]            // 00000A50: D2980005 042E0B07
  v_or_b32      v6, 0x00040000, v6                          // 00000A58: 380C0CFF 00040000
  v_sub_i32     v4, vcc, 1, v4                              // 00000A60: 4C080881
  v_min_i32     v7, 0x00000300, v4                          // 00000A64: 220E08FF 00000300
  v_sub_i32     v8, vcc, 0, v7                              // 00000A6C: 4C100E80
  v_sub_i32     v4, vcc, v4, v7                             // 00000A70: 4C080F04
  v_ldexp_f64   v[11:12], v[9:10], v7                       // 00000A74: D2D0000B 00020F09
  v_ldexp_f64   v[13:14], v[11:12], v8                      // 00000A7C: D2D0000D 0002110B
  v_add_f64     v[8:9], v[9:10], -v[13:14]                  // 00000A84: D2C80008 40021B09
  v_add_f64     v[5:6], v[8:9], v[5:6]                      // 00000A8C: D2C80005 00020B08
  v_ldexp_f64   v[7:8], 1.0, v7                             // 00000A94: D2D00007 00020EF2
  v_fma_f64     v[5:6], v[5:6], v[7:8], v[11:12]            // 00000A9C: D2980005 042E0F05
  v_ldexp_f64   v[4:5], v[5:6], v4                          // 00000AA4: D2D00004 00020905
  v_div_fixup_f64  v[4:5], v[4:5], v[2:3], 1.0              // 00000AAC: D2C00004 03CA0504
  v_add_f64     v[2:3], v[2:3], v[4:5]                      // 00000AB4: D2C80002 00020902
  s_waitcnt     lgkmcnt(0)                                  // 00000ABC: BF8C007F
  v_mov_b32     v4, s0                                      // 00000AC0: 7E080200
//-----------------------------------------------------------------------------------
// writes p[0] from v[2:3]
//  fails to load v[0:1] with dd.x[0];
//  writes v[0:1] with result of loop N-2 (from above) to p[1]
//-----------------------------------------------------------------------------------
  tbuffer_store_format_xy  v[2:3], v4, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
  tbuffer_store_format_xy  v[0:1], v4, s[4:7], 0 offen offset:8 format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT]
  s_endpgm                                                  // 00000AD4: BF810000


Assembly code with the loop commented out. Compiler correctly loads dd.x[0] and stores to p[1].


  s_buffer_load_dword  s0, s[8:11], 0x00                    // 00000000: C2000900
  s_buffer_load_dwordx4  s[8:11], s[8:11], 0x04             // 00000004: C2840904
  s_waitcnt     lgkmcnt(0)                                  // 00000008: BF8C007F
  v_mov_b32     v0, s0                                      // 0000000C: 7E000200
  v_mov_b32     v1, s8                                      // 00000010: 7E020208
  v_mov_b32     v2, s9                                      // 00000014: 7E040209
  tbuffer_store_format_xy  v[1:2], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000018: EBDD1010 80010100
  v_mov_b32     v0, s0                                      // 00000020: 7E000200
  s_waitcnt     expcnt(0)                                   // 00000024: BF8C1F0F
  v_lshl_b64    v[1:2], 0, 0                                // 00000028: D2C20001 00010080
  tbuffer_store_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBDD1000 80010100
  s_endpgm                                                  /



0 Likes
yoyo
Journeyman III

0 Likes

This seems to be a bug.  I have reported the issue to the team.

0 Likes

The team has reproduced this issue, and a bug has been filed.  I will keep you updated on the status.

0 Likes

Thank you, I will be waiting for the news.

Meanwhile, the bug is reproducible with 14.1 beta.

0 Likes
prao
Staff

Hi,

I doubt if the way the DDouble structure is passed to the kernel is correct. Its a structure type. You would have to pass it as a pointer either with __global or __constant (as its read only here). With this change I see it works correctly.

string src_string = 
        "//#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" 
        "typedef struct { double x[2]; } DDouble;      \n" 
        "                                              \n" 
        "#define N 13                                  \n" 
        "__kernel void A(__global double* p,           \n" 
        "                __global DDouble* dd)                   \n" 
        "{                                             \n" 
        "  double tmp;                              \n" 
        "                                              \n" 
        "  tmp[0] = 3.14159;                           \n" 
        "  for(int i = 1; i < N; i++)                  \n" 
        "     tmp = tmp[i-1]+1.0/tmp[i-1];          \n" 
        "                                              \n" 
        "  p[0] = tmp[N-1];                            \n" 
"  p[1] = dd->x[0];                             \n" 
        "}                                             \n"; 


DDouble dd; 
dd.x[0] = 1.2345; 
dd.x[1] = 5.4321; 

cl::Buffer cldd(context, CL_MEM_READ_ONLY, sizeof(DDouble));
queue.enqueueWriteBuffer(cldd, CL_TRUE, 0,sizeof(DDouble), &dd, NULL, NULL);

cl::Buffer buff(context, CL_MEM_READ_WRITE, 2*sizeof(double)); 

kernel.setArg(0, buff); 
kernel.setArg(1, cldd); 
queue.enqueueTask(kernel); 

Another option is to use cl_double2 in which case you don't need to use the address qualifier. Verified that this also works fine.

Regards

Pradeep

0 Likes
yoyo
Journeyman III


I doubt if the way the DDouble structure is passed to the kernel is correct. Its a structure type. You would have to pass it as a pointer either with __global or __constant (as its read only here).


Sorry, can't see your argument. From what I know about C language, there is no much difference between passing a built-in type and passing a type, defined with typedef. Neither I was able to find a place in OpenCL specification saying that I can't pass a structure as a private argument. Please correct me if I'm wrong. Also it somehow works for N <= 11.

Unfortunately, your code seems irrelevant to the issue, as the bug concerns private data integrity, not the ability to read from global memory.

0 Likes

I tried another experiment today. By changing the datatype of the private variable "tmp" declared inside the kernel from double to float, the sample works fine for N>11 too. So possibly the issue has to do with exceeding the private memory limit per work-item.That might be the reason why it works till N <=11 when "tmp" type is double. Anyway, I will try to get a clarification on this.

Regards

Pradeep

0 Likes