14 Replies Latest reply on Feb 16, 2011 6:11 PM by MicahVillmow

    Strange Error with Local Memory on RV770

    schrotti007
      When I allocate more local Memory than actually needed, my kernel stops working

      Hello

      I Discovered (maybe) a quite strange bug. I have a simple Kernel that does nothing more than copying the input data into a local buffer and copying from that buffer into the output buffer.

      On the host, I must allocate at least 4*localworksize, because i copy 4 bytes at once. But if I allocate more, lets say 16*localworksize, the kernel does not copy the data correctly. But I can find out only by comparing the input data with the output data, the returnvalue of clEnqueueNDRangeKernel is always CL_SUCCESS.

      Is it just a bug in the local Memory of the RV770 (I use a 4850), or a does the local memory just not work this way? (I didnt even access the local memory from other threads, so of course the use of local memory is pretty useless here, its just a simple example to spot the problem)

      Almost forgot:

      Using Stream Sdk 2.2, Catalyst 10.10, Radeon 4850 with 512Mb and Windows 7 64 bit.

      //HelloCL_Kernels.cl __kernel void blocktest(__global uchar4 * output , __global uchar4 * input , __local uchar4 * block0 ) { unsigned int blockIdx = get_group_id(0); unsigned int localIndex = get_local_id(0); unsigned int globalIndex = blockIdx * get_local_size(0) + localIndex; block0[localIndex] = input[globalIndex]; output[globalIndex] = block0[localIndex]; } //HelloCL.cpp #include <stdio.h> #include <CL/cl.h> #include <iostream> #define CHECKSTATUS if (status != CL_SUCCESS) {cout << "error\n";return 1;} using namespace std; int main() { cl_uint numPlatforms; cl_platform_id platform = NULL; cl_int status; size_t size = 1048576*64; cl_int localthreads = 64; status = clGetPlatformIDs(0, NULL, &numPlatforms); CHECKSTATUS cout << numPlatforms << " OpenCl Platforms found:\n"; cl_platform_id* platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); CHECKSTATUS if (numPlatforms!=1) { unsigned int pf; cout << "Select Platform: "; cin >> pf; if (pf<numPlatforms) platform = platforms[pf]; else {cout << "Falsche Antwort :(";return 1;} } else platform = platforms[0]; cl_device_id deviceId; status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &deviceId, 0); CHECKSTATUS cl_context context; cl_command_queue commandQueue; context = clCreateContext(NULL, 1, &deviceId, NULL, NULL, &status); CHECKSTATUS commandQueue = clCreateCommandQueue(context, deviceId, NULL, &status); CHECKSTATUS FILE* f = fopen("d:\\oo.exe", "rb"); if (!f) { cout << "File not found\n"; return 1; } fseek(f, 0, SEEK_END); size_t filesize = ftell(f); fseek(f, 0, SEEK_SET); if (filesize < size) { cout << "File 2 small\n"; return 1; } char* indata = (char*)malloc(size); fread(indata, 1, size, f); fclose(f); char* outdata = (char*)malloc(size); f = fopen("HelloCL_Kernels.cl", "rb"); if (!f) { cout << "File not found\n"; return 1; } fseek(f, 0, SEEK_END); filesize = ftell(f); fseek(f, 0, SEEK_SET); char* kernelString = (char*)malloc(filesize+1); fread(kernelString, 1, filesize, f); fclose(f); kernelString[filesize]=0; cl_program program; cl_kernel kernel; program = clCreateProgramWithSource(context, 1,(const char**) &kernelString, &filesize, &status); CHECKSTATUS status = clBuildProgram(program, 1, &deviceId, NULL, NULL, NULL); if (status != CL_SUCCESS) { cout << "Cant compile Program\n";//, status is "<<status<< StrCLError(status) << "\n"; char* programLog = (char*) malloc(1024000); // evtl ein bischen übertrieben, aber 10000 is definitiv zuwenig! size_t muhh = 1337; status = clGetProgramBuildInfo(program, deviceId, CL_PROGRAM_BUILD_LOG, 1024000, programLog, &muhh); if (status != CL_SUCCESS) cout << "Cant get extended infos\n"; cout << muhh << "\n"; programLog[muhh]=0; cout << programLog; return 1; } size_t globalThreads[1]= {size/4}; // every globalthread should copy 4 bytes size_t localThreads[1] = {4}; cl_event evt; kernel = clCreateKernel(program, "blocktest", &status); CHECKSTATUS cl_command_queue commandqueue = clCreateCommandQueue(context, deviceId, NULL, &status); CHECKSTATUS cl_mem indata_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &status); CHECKSTATUS status = clEnqueueWriteBuffer(commandqueue, indata_device, CL_TRUE, 0, size, indata, 0, NULL, NULL); CHECKSTATUS cl_mem outdata_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &status); CHECKSTATUS status = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&outdata_device); CHECKSTATUS status = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&indata_device); CHECKSTATUS status = clSetKernelArg(kernel, 2, localthreads*64, NULL); CHECKSTATUS status = clEnqueueNDRangeKernel(commandqueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &evt); CHECKSTATUS status = clWaitForEvents(1, &evt); CHECKSTATUS status = clEnqueueReadBuffer(commandqueue, outdata_device, CL_TRUE, 0, size, outdata, 0, NULL, NULL); CHECKSTATUS cout << "Compare inbuffer to outbuffer\n"; for (unsigned int i=0;i<size;i++) { if (indata[i] != outdata[i]) { cout << "Data mismatch at " << i << "\n"; } } cout << "Done!\n"; cin >> status; }

        • Strange Error with Local Memory on RV770
          nou

          you do not enable byte adresable extension in the kernel. and IIRC it is not supported on R7xx GPU.

          • Strange Error with Local Memory on RV770
            MicahVillmow
            Byte addressable is only for writes to memory of less than 32 bits. Otherwise you can play with any data type you want.

            p.s. I'll be looking into this later today to see if I can spot the problem.
              • Strange Error with Local Memory on RV770
                schrotti007

                Found an error in my code.

                Line 89 should be

                size_t localThreads[1] = {localthreads};

                not

                size_t localThreads[1] = {4};

                as it was before. The error occours then when in line 105, the given size of the localmemory is localthreads*256, it works if it is localthreads*128. Tried different sizes for localthreads...

                 

                  • Strange Error with Local Memory on RV770
                    DTop

                    However, I do not see how it could affect the execution. Local thread number = {4} means you have only 4 thread in workgroup. That's all. And will have more workgroups to run.

                    In detail, to get a more clear question:

                    (localthreads = 64) x 256  = 16k LDS you want at most, what is ok for 4850.  In case of (localthreads=4) * 256 = 1K (LDS size) you have only 4 threads running per group.  Does it mean that ocl launches few work groups on single simd (in this case 16 worgroups) so addressing in kernel just go wrong and concurrent workgroups just screw up each other?

                    (I also puzzled that even if localthreads=4, how come (4x256)=1K LDS fails, while 4x128 = 512bytes is ok. ???).

                      • Strange Error with Local Memory on RV770
                        docbrown

                        I've played around with schrotti007's sample code on my RV730 (had to adapt it a bit).

                        I used 16K local buffers and 32 thread per group.

                        Everything works fine with 16M global buffers but the copying fails with 32M global buffers. If I reduce the local buffers to 8K then 32M global buffers work.

                        #include <stdio.h> #include <CL/cl.h> #include <iostream> using namespace std; void checkStatus(int status, const char* func) { if(status != CL_SUCCESS) { cout << "Unsuccessfull call to " << func << "(): " << status << endl; exit(status); } } int main(int argc, char* argv[]) { cl_uint numPlatforms; cl_platform_id platform = NULL; cl_int status; size_t size = 32 * 1024 * 1024; cl_int localthreads = 32; status = clGetPlatformIDs(0, NULL, &numPlatforms); checkStatus(status, "clGetPlatformIDs"); cout << numPlatforms << " OpenCl Platform(s) found" << endl; cl_platform_id* platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); checkStatus(status, "clGetPlatformIDs"); if (numPlatforms!=1) { unsigned int pf; cout << "Select Platform: "; cin >> pf; if (pf<numPlatforms) { platform = platforms[pf]; } else { cout << "Falsche Antwort :("; return 1; } } else { platform = platforms[0]; } cl_device_id deviceId; status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &deviceId, 0); checkStatus(status, "clGetDeviceIDs"); cl_context context; cl_command_queue commandQueue; context = clCreateContext(NULL, 1, &deviceId, NULL, NULL, &status); checkStatus(status, "clCreateContext"); commandQueue = clCreateCommandQueue(context, deviceId, NULL, &status); checkStatus(status, "clCreateCommandQueue"); FILE* f = fopen("HelloCL_Kernels.cl", "rb"); if(!f) { cout << "File not found or cannot be opened!" << endl; return 1; } fseek(f, 0, SEEK_END); size_t filesize = ftell(f); fseek(f, 0, SEEK_SET); char* kernelString = (char*)malloc(filesize+1); fread(kernelString, 1, filesize, f); fclose(f); kernelString[filesize] = 0; cl_program program; cl_kernel kernel; program = clCreateProgramWithSource(context, 1, (const char**)&kernelString, &filesize, &status); checkStatus(status, "clCreateProgramWithSource"); status = clBuildProgram(program, 1, &deviceId, NULL, NULL, NULL); if (status != CL_SUCCESS) { cout << "Cannot compile kernel: " << status << endl; // << StrCLError(status) << endl; /*char* programLog = (char*) malloc(1024000); // evtl ein bischen übertrieben, aber 10000 is definitiv zuwenig! size_t muhh = 1337; status = clGetProgramBuildInfo(program, deviceId, CL_PROGRAM_BUILD_LOG, 1024000, programLog, &muhh); if (status != CL_SUCCESS) cout << "Cant get extended infos\n"; cout << muhh << "\n"; programLog[muhh]=0; cout << programLog;*/ return status; } char* indata = new char[size]; char* outdata = new char[size]; for(int n = 0; n < size; n++) { indata[n] = rand() % 255; outdata[n] = 0; } size_t globalThreads[1] = {size/4}; // every globalthread should copy 4 bytes size_t localThreads[1] = {localthreads}; cl_event evt; kernel = clCreateKernel(program, "blocktest", &status); checkStatus(status, "clCreateKernel"); cl_command_queue commandqueue = clCreateCommandQueue(context, deviceId, NULL, &status); checkStatus(status, "clCreateCommandQueue"); cl_mem indata_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &status); checkStatus(status, "clCreateBuffer"); status = clEnqueueWriteBuffer(commandqueue, indata_device, CL_TRUE, 0, size, indata, 0, NULL, NULL); checkStatus(status, "clEnqueueWriteBuffer"); cl_mem outdata_device = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &status); checkStatus(status, "clCreateBuffer"); status = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&outdata_device); checkStatus(status, "clSetKernelArg"); status = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&indata_device); checkStatus(status, "clSetKernelArg"); status = clSetKernelArg(kernel, 2, 256*localthreads, NULL); checkStatus(status, "clSetKernelArg"); status = clEnqueueNDRangeKernel(commandqueue, kernel, 1, // 1-dimensional NULL, // reserved globalThreads, localThreads, // Work group size 0, NULL, &evt); checkStatus(status, "clEnqueueNDRangeKernel"); status = clWaitForEvents(1, &evt); checkStatus(status, "clWaitForEvents"); status = clEnqueueReadBuffer(commandqueue, outdata_device, CL_TRUE, 0, size, outdata, 0, NULL, NULL); checkStatus(status, "clEnqueueReadBuffer"); cout << "Compare inbuffer to outbuffer\n"; for (unsigned int i = 0; i < size; i++) { if (indata[i] != outdata[i]) { cout << "Data mismatch at " << i << endl; break; } } cout << "Done!\n"; }

                          • Strange Error with Local Memory on RV770
                            himanshu.gautam

                            doc brown,

                            I think we can only use a part of global memory to emulate local memory in global mamory.

                            What does the clInfo return?(spaecially Local Memory size)

                              • Strange Error with Local Memory on RV770
                                docbrown

                                Here is the output of clinfo on my system.

                                Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 8 Max work items dimensions: 3 Max work items[0]: 128 Max work items[1]: 128 Max work items[2]: 128 Max work group size: 128 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 0Mhz Address bits: 32 Max memory allocation: 134217728 Image support: No Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 268435456 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 16384 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 0x7f5872361b20 Name: ATI RV730 Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.815 Profile: FULL_PROFILE Version: OpenCL 1.0 ATI-Stream-v2.2 (302) Extensions: cl_khr_icd cl_khr_gl_sharing cl_amd_device_attribute_query

                                  • Strange Error with Local Memory on RV770
                                    himanshu.gautam

                                    docbrown,

                                    Local memory in RV7xx device is emulated in global memory and therefore same global memory is used for local buffers as well as global buffers.

                                    I think you are working almost on the limit. so if one buffer is big other becomes doesn't have enough space.

                                    In which API do you get error and  what is the error code returned?

                                      • Strange Error with Local Memory on RV770
                                        docbrown

                                        himanshu.gautam,

                                        the little test program is running just fine with every tested memory size (within specifications of the GPU), so no error code at all.

                                        But, the data which is copied by the kernel is in some cases invalid. That means copying data from a buffer A to a buffer B leaves a buffer B which does not contain the same data as in A.

                                        I know that the local memory is emulated via global memory, but in my opinion the GPU should not reuse global buffers that are already used for emulating local buffers, should it?

                                        Regards

                                        Edit: I used the ATI Stream SDK v2.2 on Debian Testing with fglrx 10.9.

                                          • Strange Error with Local Memory on RV770
                                            himanshu.gautam

                                            docbrown,

                                            Definitely GPU does not reuse global buffers already used for emulating local buffes.

                                            Can you post the kernel code also. I can try to see if the problem exists with internal driver.

                                            You can also send a testcase at streamdeveloper@amd.com

                                            Also mention the Operating System you are using.

                                              • Strange Error with Local Memory on RV770
                                                schrotti007

                                                He is using the same Kernel I used in the first post of this Topic...

                                                The exact same error happens on Windows 7 64 bit, Win XP 32 bit and Linux 64 bit (no other Operating Systems tested).

                                                  • Strange Error with Local Memory on RV770
                                                    schrotti007

                                                    Just tried the new 10.12 Catalyst and the problem now occurs even where everything was fine with 10.11. Now any number of globalthreads bigger than 1048576*4 seems to fail when local memory is used...

                                                    Windows 7, 64 bit, Radeon 4850

                                                    Edit:

                                                    Now something really strange: For testing, i inserted a Barrier between block0[localIndex]  = input[globalIndex] and output[globalIndex] =  block0[localIndex], this actually caused the problems with the 10.12 driver, without the Barrier, it works as it should, with the Barrier, it was worse than in 10.11.

                                                    So of course, this simple testcase doesnt need a barrier, but still, why can a barrier cause corrupt data?

                                • Strange Error with Local Memory on RV770
                                  MicahVillmow
                                  schrotti007,
                                  Thanks for reporting this issue, it will be fixed in the next SDK update.