cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

schrotti007
Journeyman III

Strange Error with Local Memory on RV770

When I allocate more local Memory than actually needed, my kernel stops working

Hello

I Discovered (maybe) a quite strange bug. I have a simple Kernel that does nothing more than copying the input data into a local buffer and copying from that buffer into the output buffer.

On the host, I must allocate at least 4*localworksize, because i copy 4 bytes at once. But if I allocate more, lets say 16*localworksize, the kernel does not copy the data correctly. But I can find out only by comparing the input data with the output data, the returnvalue of clEnqueueNDRangeKernel is always CL_SUCCESS.

Is it just a bug in the local Memory of the RV770 (I use a 4850), or a does the local memory just not work this way? (I didnt even access the local memory from other threads, so of course the use of local memory is pretty useless here, its just a simple example to spot the problem)

Almost forgot:

Using Stream Sdk 2.2, Catalyst 10.10, Radeon 4850 with 512Mb and Windows 7 64 bit.

//HelloCL_Kernels.cl __kernel void blocktest(__global uchar4 * output , __global uchar4 * input , __local uchar4 * block0 ) { unsigned int blockIdx = get_group_id(0); unsigned int localIndex = get_local_id(0); unsigned int globalIndex = blockIdx * get_local_size(0) + localIndex; block0[localIndex] = input[globalIndex]; output[globalIndex] = block0[localIndex]; } //HelloCL.cpp #include <stdio.h> #include <CL/cl.h> #include <iostream> #define CHECKSTATUS if (status != CL_SUCCESS) {cout << "error\n";return 1;} using namespace std; int main() { cl_uint numPlatforms; cl_platform_id platform = NULL; cl_int status; size_t size = 1048576*64; cl_int localthreads = 64; status = clGetPlatformIDs(0, NULL, &numPlatforms); CHECKSTATUS cout << numPlatforms << " OpenCl Platforms found:\n"; cl_platform_id* platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); CHECKSTATUS if (numPlatforms!=1) { unsigned int pf; cout << "Select Platform: "; cin >> pf; if (pf<numPlatforms) platform = platforms[pf]; else {cout << "Falsche Antwort :(";return 1;} } else platform = platforms[0]; cl_device_id deviceId; status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &deviceId, 0); CHECKSTATUS cl_context context; cl_command_queue commandQueue; context = clCreateContext(NULL, 1, &deviceId, NULL, NULL, &status); CHECKSTATUS commandQueue = clCreateCommandQueue(context, deviceId, NULL, &status); CHECKSTATUS FILE* f = fopen("d:\\oo.exe", "rb"); if (!f) { cout << "File not found\n"; return 1; } fseek(f, 0, SEEK_END); size_t filesize = ftell(f); fseek(f, 0, SEEK_SET); if (filesize < size) { cout << "File 2 small\n"; return 1; } char* indata = (char*)malloc(size); fread(indata, 1, size, f); fclose(f); char* outdata = (char*)malloc(size); f = fopen("HelloCL_Kernels.cl", "rb"); if (!f) { cout << "File not found\n"; return 1; } fseek(f, 0, SEEK_END); filesize = ftell(f); fseek(f, 0, SEEK_SET); char* kernelString = (char*)malloc(filesize+1); fread(kernelString, 1, filesize, f); fclose(f); kernelString[filesize]=0; cl_program program; cl_kernel kernel; program = clCreateProgramWithSource(context, 1,(const char**) &kernelString, &filesize, &status); CHECKSTATUS status = clBuildProgram(program, 1, &deviceId, NULL, NULL, NULL); if (status != CL_SUCCESS) { cout << "Cant compile Program\n";//, status is "<<status<< StrCLError(status) << "\n"; char* programLog = (char*) malloc(1024000); // evtl ein bischen übertrieben, aber 10000 is definitiv zuwenig! size_t muhh = 1337; status = clGetProgramBuildInfo(program, deviceId, CL_PROGRAM_BUILD_LOG, 1024000, programLog, &muhh); if (status != CL_SUCCESS) cout << "Cant get extended infos\n"; cout << muhh << "\n"; programLog[muhh]=0; cout << programLog; return 1; } size_t globalThreads[1]= {size/4}; // every globalthread should copy 4 bytes size_t localThreads[1] = {4}; cl_event evt; kernel = clCreateKernel(program, "blocktest", &status); CHECKSTATUS cl_command_queue commandqueue = clCreateCommandQueue(context, deviceId, NULL, &status); CHECKSTATUS cl_mem indata_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &status); CHECKSTATUS status = clEnqueueWriteBuffer(commandqueue, indata_device, CL_TRUE, 0, size, indata, 0, NULL, NULL); CHECKSTATUS cl_mem outdata_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &status); CHECKSTATUS status = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&outdata_device); CHECKSTATUS status = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&indata_device); CHECKSTATUS status = clSetKernelArg(kernel, 2, localthreads*64, NULL); CHECKSTATUS status = clEnqueueNDRangeKernel(commandqueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &evt); CHECKSTATUS status = clWaitForEvents(1, &evt); CHECKSTATUS status = clEnqueueReadBuffer(commandqueue, outdata_device, CL_TRUE, 0, size, outdata, 0, NULL, NULL); CHECKSTATUS cout << "Compare inbuffer to outbuffer\n"; for (unsigned int i=0;i<size;i++) { if (indata != outdata) { cout << "Data mismatch at " << i << "\n"; } } cout << "Done!\n"; cin >> status; }

0 Likes
14 Replies
nou
Exemplar

you do not enable byte adresable extension in the kernel. and IIRC it is not supported on R7xx GPU.

0 Likes

Why do I need this, I access char4, not char? But anyway shouldnt I get an error when compiling the kernel?

Is it generally ok to access the x, y, z, w components of a char4 without byte adressable extension?

0 Likes

Byte addressable is only for writes to memory of less than 32 bits. Otherwise you can play with any data type you want.

p.s. I'll be looking into this later today to see if I can spot the problem.
0 Likes

Found an error in my code.

Line 89 should be

size_t localThreads[1] = {localthreads};

not

size_t localThreads[1] = {4};

as it was before. The error occours then when in line 105, the given size of the localmemory is localthreads*256, it works if it is localthreads*128. Tried different sizes for localthreads...

 

0 Likes

However, I do not see how it could affect the execution. Local thread number = {4} means you have only 4 thread in workgroup. That's all. And will have more workgroups to run.

In detail, to get a more clear question:

(localthreads = 64) x 256  = 16k LDS you want at most, what is ok for 4850.  In case of (localthreads=4) * 256 = 1K (LDS size) you have only 4 threads running per group.  Does it mean that ocl launches few work groups on single simd (in this case 16 worgroups) so addressing in kernel just go wrong and concurrent workgroups just screw up each other?

(I also puzzled that even if localthreads=4, how come (4x256)=1K LDS fails, while 4x128 = 512bytes is ok. ???).

0 Likes

I've played around with schrotti007's sample code on my RV730 (had to adapt it a bit).

I used 16K local buffers and 32 thread per group.

Everything works fine with 16M global buffers but the copying fails with 32M global buffers. If I reduce the local buffers to 8K then 32M global buffers work.

#include <stdio.h> #include <CL/cl.h> #include <iostream> using namespace std; void checkStatus(int status, const char* func) { if(status != CL_SUCCESS) { cout << "Unsuccessfull call to " << func << "(): " << status << endl; exit(status); } } int main(int argc, char* argv[]) { cl_uint numPlatforms; cl_platform_id platform = NULL; cl_int status; size_t size = 32 * 1024 * 1024; cl_int localthreads = 32; status = clGetPlatformIDs(0, NULL, &numPlatforms); checkStatus(status, "clGetPlatformIDs"); cout << numPlatforms << " OpenCl Platform(s) found" << endl; cl_platform_id* platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); checkStatus(status, "clGetPlatformIDs"); if (numPlatforms!=1) { unsigned int pf; cout << "Select Platform: "; cin >> pf; if (pf<numPlatforms) { platform = platforms[pf]; } else { cout << "Falsche Antwort :("; return 1; } } else { platform = platforms[0]; } cl_device_id deviceId; status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &deviceId, 0); checkStatus(status, "clGetDeviceIDs"); cl_context context; cl_command_queue commandQueue; context = clCreateContext(NULL, 1, &deviceId, NULL, NULL, &status); checkStatus(status, "clCreateContext"); commandQueue = clCreateCommandQueue(context, deviceId, NULL, &status); checkStatus(status, "clCreateCommandQueue"); FILE* f = fopen("HelloCL_Kernels.cl", "rb"); if(!f) { cout << "File not found or cannot be opened!" << endl; return 1; } fseek(f, 0, SEEK_END); size_t filesize = ftell(f); fseek(f, 0, SEEK_SET); char* kernelString = (char*)malloc(filesize+1); fread(kernelString, 1, filesize, f); fclose(f); kernelString[filesize] = 0; cl_program program; cl_kernel kernel; program = clCreateProgramWithSource(context, 1, (const char**)&kernelString, &filesize, &status); checkStatus(status, "clCreateProgramWithSource"); status = clBuildProgram(program, 1, &deviceId, NULL, NULL, NULL); if (status != CL_SUCCESS) { cout << "Cannot compile kernel: " << status << endl; // << StrCLError(status) << endl; /*char* programLog = (char*) malloc(1024000); // evtl ein bischen übertrieben, aber 10000 is definitiv zuwenig! size_t muhh = 1337; status = clGetProgramBuildInfo(program, deviceId, CL_PROGRAM_BUILD_LOG, 1024000, programLog, &muhh); if (status != CL_SUCCESS) cout << "Cant get extended infos\n"; cout << muhh << "\n"; programLog[muhh]=0; cout << programLog;*/ return status; } char* indata = new char[size]; char* outdata = new char[size]; for(int n = 0; n < size; n++) { indata = rand() % 255; outdata = 0; } size_t globalThreads[1] = {size/4}; // every globalthread should copy 4 bytes size_t localThreads[1] = {localthreads}; cl_event evt; kernel = clCreateKernel(program, "blocktest", &status); checkStatus(status, "clCreateKernel"); cl_command_queue commandqueue = clCreateCommandQueue(context, deviceId, NULL, &status); checkStatus(status, "clCreateCommandQueue"); cl_mem indata_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &status); checkStatus(status, "clCreateBuffer"); status = clEnqueueWriteBuffer(commandqueue, indata_device, CL_TRUE, 0, size, indata, 0, NULL, NULL); checkStatus(status, "clEnqueueWriteBuffer"); cl_mem outdata_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &status); checkStatus(status, "clCreateBuffer"); status = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&outdata_device); checkStatus(status, "clSetKernelArg"); status = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&indata_device); checkStatus(status, "clSetKernelArg"); status = clSetKernelArg(kernel, 2, 256*localthreads, NULL); checkStatus(status, "clSetKernelArg"); status = clEnqueueNDRangeKernel(commandqueue, kernel, 1, // 1-dimensional NULL, // reserved globalThreads, localThreads, // Work group size 0, NULL, &evt); checkStatus(status, "clEnqueueNDRangeKernel"); status = clWaitForEvents(1, &evt); checkStatus(status, "clWaitForEvents"); status = clEnqueueReadBuffer(commandqueue, outdata_device, CL_TRUE, 0, size, outdata, 0, NULL, NULL); checkStatus(status, "clEnqueueReadBuffer"); cout << "Compare inbuffer to outbuffer\n"; for (unsigned int i = 0; i < size; i++) { if (indata != outdata) { cout << "Data mismatch at " << i << endl; break; } } cout << "Done!\n"; }

0 Likes

doc brown,

I think we can only use a part of global memory to emulate local memory in global mamory.

What does the clInfo return?(spaecially Local Memory size)

0 Likes

Here is the output of clinfo on my system.

Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 8 Max work items dimensions: 3 Max work items[0]: 128 Max work items[1]: 128 Max work items[2]: 128 Max work group size: 128 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 0Mhz Address bits: 32 Max memory allocation: 134217728 Image support: No Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 268435456 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 16384 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7f5872361b20 Name: ATI RV730 Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.815 Profile: FULL_PROFILE Version: OpenCL 1.0 ATI-Stream-v2.2 (302) Extensions: cl_khr_icd cl_khr_gl_sharing cl_amd_device_attribute_query

0 Likes

docbrown,

Local memory in RV7xx device is emulated in global memory and therefore same global memory is used for local buffers as well as global buffers.

I think you are working almost on the limit. so if one buffer is big other becomes doesn't have enough space.

In which API do you get error and  what is the error code returned?

0 Likes

himanshu.gautam,

the little test program is running just fine with every tested memory size (within specifications of the GPU), so no error code at all.

But, the data which is copied by the kernel is in some cases invalid. That means copying data from a buffer A to a buffer B leaves a buffer B which does not contain the same data as in A.

I know that the local memory is emulated via global memory, but in my opinion the GPU should not reuse global buffers that are already used for emulating local buffers, should it?

Regards

Edit: I used the ATI Stream SDK v2.2 on Debian Testing with fglrx 10.9.

0 Likes

docbrown,

Definitely GPU does not reuse global buffers already used for emulating local buffes.

Can you post the kernel code also. I can try to see if the problem exists with internal driver.

You can also send a testcase at streamdeveloper@amd.com

Also mention the Operating System you are using.

0 Likes

He is using the same Kernel I used in the first post of this Topic...

The exact same error happens on Windows 7 64 bit, Win XP 32 bit and Linux 64 bit (no other Operating Systems tested).

0 Likes

Just tried the new 10.12 Catalyst and the problem now occurs even where everything was fine with 10.11. Now any number of globalthreads bigger than 1048576*4 seems to fail when local memory is used...

Windows 7, 64 bit, Radeon 4850

Edit:

Now something really strange: For testing, i inserted a Barrier between block0[localIndex]  = input[globalIndex] and output[globalIndex] =  block0[localIndex], this actually caused the problems with the 10.12 driver, without the Barrier, it works as it should, with the Barrier, it was worse than in 10.11.

So of course, this simple testcase doesnt need a barrier, but still, why can a barrier cause corrupt data?

0 Likes

schrotti007,
Thanks for reporting this issue, it will be fixed in the next SDK update.
0 Likes