Archives Discussions

Ignus · ‎04-20-2011

Hi!

We had a very frustrating and strange problem, that I was finally be able to track down (so it seems).

Below is code for a small console program and kernel.

Running it on the GPU gives different results than the CPU. The CPU is correct, the GPU is not. It seems, that a private variable declared in the kernel overwrites the kernel argument.

Could someone from AMD investigate?

Thanks in advance.

System: Linux 64-bit (CLInfo output is at the end of the message)

-----------------

Result on CPU

-----------------

Device name: AMD Athlon(tm) II X4 635 Processor
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)
(7.000000,9.000000,6.000000,8.000000)

-----------------

Result on GPU

-----------------

Device name: Juniper
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)
(36.000000,64.000000,6.000000,8.000000)

--

Greets,

I.

----------- code.cl ----------- struct OutData { float f[2]; float a[2]; }; struct InData { float f[2]; }; kernel void testkernel(struct InData input, global struct OutData * output) { float memtmp2[2]; int gid = get_global_id(0); memtmp2[0] = 36; // overwrites "input.f[0]" memtmp2[1] = 64; // overwrites "input.f[1]" output[gid].f[0] = input.f[0]; output[gid].f[1] = input.f[1]; // If I remove the loop, it is correct for (int i = 0; i < 2; ++i) output[gid].a = sqrt(memtmp2); } ----------- main.cpp ----------- #include <iostream> #include <fstream> #include <vector> #include <stdio.h> #include <CL/cl.h> using namespace std; #define geterr \ if (err) \ { \ cout << "openCL error " << err << " at line " << __LINE__ << endl; \ exit(1); \ } struct OutData { OutData() { for (int i = 0; i < 2; ++i) f = a = 0; } cl_float f[2], a[2]; }; struct InData { cl_float f[2]; }; void exec_kernel_amdtest(cl_context& ctx, cl_command_queue& queue, cl_program& prg) { cl_int err; InData indata; indata.f[0] = 7; indata.f[1] = 9; vector<OutData> outdata(10); // allocate buffers cl_mem buf0 = clCreateBuffer(ctx, CL_MEM_READ_WRITE, outdata.size() * sizeof(outdata[0]), NULL, &err); geterr // upload data clEnqueueWriteBuffer(queue, buf0, true, 0, outdata.size() * sizeof(outdata[0]), &outdata[0], 0, NULL, NULL); cl_kernel ker = clCreateKernel(prg, "testkernel", &err); geterr size_t local_work_size[] = {outdata.size()}; size_t global_work_size[] = {outdata.size()}; clSetKernelArg(ker, 0, sizeof (indata), &indata); clSetKernelArg(ker, 1, sizeof (cl_mem), &buf0); err = clEnqueueNDRangeKernel(queue, ker, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); geterr clEnqueueReadBuffer(queue, buf0, true, 0, outdata.size() * sizeof(outdata[0]), &outdata[0], 0, NULL, NULL); clFinish(queue); for (size_t i = 0; i < outdata.size(); ++i) { printf("(%f,%f,%f,%f)\n" , outdata.f[0] , outdata.f[1] , outdata.a[0] , outdata.a[1] ); } clReleaseMemObject(buf0); } int main(int argc, char* argv[]) { cl_int err; cl_context ctx; // --- Platform cl_platform_id platforms[10]; err = clGetPlatformIDs(1, platforms, NULL); // --- Context cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties) platforms[0], 0 }; if (argc > 1) ctx = clCreateContextFromType(cps, CL_DEVICE_TYPE_CPU, NULL, NULL, &err); else ctx = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &err); geterr // --- Device cl_device_id devices[10]; err = clGetContextInfo(ctx, CL_CONTEXT_DEVICES, sizeof(devices), devices, NULL); geterr // --- Read source ifstream f("code.cl"); if (f.fail()) { cerr << "file open error" << endl; return 1; } string src = std::string(std::istreambuf_iterator<char>(f), std::istreambuf_iterator<char>()); // --- Compile const char* sources[1] = {src.c_str()}; cl_program prg = clCreateProgramWithSource(ctx, 1, sources, NULL, &err); geterr err = clBuildProgram(prg, 0, NULL, "", NULL, NULL); geterr char buildlog[100000] = {0}; clGetProgramBuildInfo(prg, devices[0], CL_PROGRAM_BUILD_LOG, sizeof(buildlog), buildlog, NULL); cout << buildlog << "\n"; // --- Device name we run on char deviceName[1024]; clGetDeviceInfo(devices[0], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL); cout << "Device name: " << deviceName << endl; // --- Command queue cl_command_queue queue = clCreateCommandQueue(ctx, devices[0], 0, &err); geterr // --- Execute kernel exec_kernel_amdtest(ctx, queue, prg); clReleaseContext(ctx); } ----------- CLinfo output ----------- Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Platform Name: AMD Accelerated Parallel Processing Number of devices: 2 Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 10 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 850Mhz Address bits: 32 Max memory allocation: 134217728 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 536870912 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7f48a8b5d800 Name: Juniper Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1332 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops cl_amd_popcnt Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 4 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 800Mhz Address bits: 64 Max memory allocation: 2147483648 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 65536 Global memory size: 4157833216 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7f48a8b5d800 Name: AMD Athlon(tm) II X4 635 Processor Vendor: AuthenticAMD Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (595.10) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf Passed!

MicahVillmow · ‎04-20-2011

Ignus,
Thanks for bringing this to my attention. I'll get it fixed. Also from a performance perspective, this is inefficient. What would be ideal is that you define InData as a constant struct InData* input argument.

Ignus · ‎04-21-2011

Micah,

You're welcome. Looking forward to it in the next release then.

Thanks for your suggestion, I'll try that.

--

Greets,

I.

Ignus · ‎04-21-2011

Hi!

I tried your suggestion, but unfortunately, it's too slow in our case.

InData usually contains just a very small number of items, like above, and we do a lot of kernel invocations, but the values in InData could be different each time. And for that, I have to call WriteBuffer() before each kernel invocation, with just a few bytes of size. And the overhead of that is just too big.

--

Greets,

I.

MicahVillmow · ‎04-20-2011

This will be fixed in the next release.

Archives Discussions

OpenCL miscompile - Private variable overwrites argument