16 Replies Latest reply on May 13, 2011 5:37 PM by himanshu.gautam

    Transpose kernel fails

    galmok
      (works fine on CPU and nVidia)

      I have a transpose kernel taken directly from cuda and converted to OpenCL and while the converted kernel works perfectly on the CPU and on nVidia's OpenCL implementation, it fails on AMD OpenCL.

      As the kernel is a direct conversion, no optimisations or anything else have been done with it. The local work group size is set to 32,8,0, i.e. a workgroupsize of 256. Checking the workgroupsize with clGetKernelWorkGroupInfo lets the kernel pass and I assume this means it should run without error caused by the chosen work group size. Is that assumption wrong?

      I know AMD cards prefer a work group size of 64...

      The kernel arguments are these: dst = destination array, ldd = destination stride, src = source array, lds = source stride.

      My own way to setup the dimensions is like this:

          ccSetWorkDim(&kernel, 2);
          ccSetLocalWorkSize(&kernel, 32, 8, 1);
          ccSetGlobalWorkSize(&kernel, ((rows+31)/32)*32, ((cols+31)/32)*8, 1);

      And I call the kernel like this:

      ccSetKernelArgs(kernel, 4, "cl_mem", &subDST, "cl_uint", dst.lda, "cl_mem", &subSRC, "cl_uint", src.lda);

      Funny thing is, I have an inplace transpose using the same workgroup size and it works just fine.

      kernel void transpose( global double *dst, int ldd, global double *src, int lds ) { int2 blockIdx = { get_group_id(0), get_group_id(1) }; int2 threadIdx = { get_local_id(0), get_local_id(1) }; src += blockIdx.x*32 + threadIdx.x + ( blockIdx.y*32 + threadIdx.y ) * lds; dst += blockIdx.y*32 + threadIdx.x + ( blockIdx.x*32 + threadIdx.y ) * ldd; local double a[32][33]; // // load 32x32 block // for( int i = 0; i < 32; i += BLOCK_SIZE ) a[i+threadIdx.y][threadIdx.x] = src[i*lds]; barrier(CLK_LOCAL_MEM_FENCE); // // store transposed block // for( int i = 0; i < 32; i += BLOCK_SIZE ) dst[i*ldd] = a[threadIdx.x][i+threadIdx.y]; }

        • Transpose kernel fails
          MicahVillmow
          What device are you executing this on? Also you are not specifying that you are enabling 64bit. Are you checking all of your error messages?
          • Transpose kernel fails
            galmok

            I am sorry. I forgot to provide full information:

            The kernel runs on an ATI 5870 and the listed kernel is only 1 of many kernels in the file. Double is enabled:

            #ifdef cl_amd_fp64
            #pragma OPENCL EXTENSION cl_amd_fp64: enable
            #endif

            Many of the other kernels in the file work as intended (haven't tested all on AMD but all kernels work on nVidia). The non-working kernel is the first defined in the file.

            #define BLOCK_SIZE 8

            I have tried allocating the local memory from the host using clSetKernelArgs but it changed nothing. Adding 0.0 to the last assignment seems to change the output:

            dst[i*ldd] = a[threadIdx.x][i+threadIdx.y] + 0.0; // 0.0 changes output

              • Transpose kernel fails
                galmok

                Any news on this issue?

                  • Transpose kernel fails
                    himanshu.gautam

                    Can you also post the host code.

                    Do you mean the kernel fails correctness when adding 0.0 to the output, or does it fail the correctness in the kernel code you posted also.

                      • Transpose kernel fails
                        galmok

                        The kernels fails correctness regardless of adding 0.0, but adding 0.0 changes the output (which it should not).

                        Regarding the host code, I'll have to create something from scratch as the whole project is over 2400 lines of code, most of it not relevant to this problem (and I would prefer not to upload it here). I'll post the new host code soon.

                          • Transpose kernel fails
                            galmok

                            Here is the host code and the kernel code below. It is assembled from many different files and as such looks a mess, but it is working perfectly on nVidia and on AMD platform using CPU, but not AMD GPU.

                            #define _CRT_SECURE_NO_WARNINGS 1 #include <stdio.h> #include <stdlib.h> #include <math.h> #include <float.h> #include <CL/cl.h> #include <Windows.h> #define KERNEL_TRANSPOSE 0 #define NUM_KERNELS 1 #define TEST_OK 0 #define TEST_FAIL 1 typedef struct ccKernel_struct { cl_kernel kernel; unsigned int work_dim; unsigned int global_work_size[3]; unsigned int local_work_size[3]; } ccKernel; #define inline __inline static cl_uint num_devices; static cl_device_id *devices; static cl_context context; static cl_program program; ccKernel cckernel[NUM_KERNELS]; cl_command_queue queue; // must match defines in cl file #define BLOCK_SIZE 8 void checkErr(cl_int err, const char * name, ...) { va_list argptr; va_start(argptr,name); if (err != CL_SUCCESS) { fprintf(stderr, "ERROR %i: ", err); vfprintf(stderr, name, argptr); exit(EXIT_FAILURE); } va_end(argptr); } struct p2_t { double *A; int lda; cl_mem mem; int row_offset; int col_offset; int size; // only used for clCreateSubBuffer p2_t() {} p2_t( double *_A, int _lda ) : A(_A),lda(_lda),mem(NULL),row_offset(0),col_offset(0),size(0) {} p2_t( double *_A, int _lda, cl_mem _mem, int _i, int _j, int _size ) : A(_A),lda(_lda), mem(_mem),row_offset(_i),col_offset(_j),size(_size) {} p2_t operator() (int i, int j) { return p2_t(A+i+j*lda,lda,mem,row_offset+i,col_offset+j,size); }; double &at (int i, int j) { return A[i+j*lda]; }; // CPU ONLY! }; p2_t gpu_malloc2D(int height, int width, cl_mem_flags flags) { cl_int err; if( height <= 0 || width <= 0 ) return p2_t( NULL, 0 ); p2_t p2(NULL, height); // allocate device memory p2.mem = clCreateBuffer(context, flags, sizeof(cl_double) * height * width, NULL, &err); checkErr(err, "gpu_malloc2D\n"); p2.row_offset = 0; p2.col_offset = 0; p2.size = height*width*sizeof(double); return p2; } p2_t gpu_malloc2D(int height, int width) { return gpu_malloc2D(height, width, CL_MEM_READ_WRITE); } void upload(int m, int n, p2_t dst, p2_t src ) { cl_int err; if( m > 0 && n > 0 ) { const size_t buffer_origin[3] = {dst.row_offset*sizeof(double), dst.col_offset, 0}; const size_t host_origin[3] = {0, 0, 0}; const size_t region[3] = {m*sizeof(double), n, 1}; err = clEnqueueWriteBufferRect(queue, dst.mem, CL_TRUE, buffer_origin, host_origin, region, dst.lda*sizeof(double), 0, src.lda*sizeof(double), 0, src.A, 0, NULL, NULL); checkErr(err, "clEnqueueWriteBufferRect\n"); } } void download( int m, int n, p2_t dst, p2_t src ) { if( m > 0 && n > 0) { cl_int err; const size_t buffer_origin[3] = {src.row_offset*sizeof(double), src.col_offset, 0}; const size_t host_origin[3] = {0, 0, 0}; const size_t region[3] = {m*sizeof(double), n, 1}; err = clEnqueueReadBufferRect(queue, src.mem, CL_TRUE, buffer_origin, host_origin, region, src.lda*sizeof(double), 0, dst.lda*sizeof(double), 0, dst.A, 0, NULL, NULL); checkErr(err, "clEnqueueReadBufferRect\n"); } } void ccSetWorkDim(ccKernel *kernel, int dim) { kernel->work_dim = dim; } void ccSetLocalWorkSize(ccKernel *kernel, int dim1, int dim2, int dim3) { kernel->local_work_size[0] = dim1; kernel->local_work_size[1] = dim2; kernel->local_work_size[2] = dim3; } void ccSetGlobalWorkSize(ccKernel *kernel, int dim1, int dim2, int dim3) { kernel->global_work_size[0] = dim1; kernel->global_work_size[1] = dim2; kernel->global_work_size[2] = dim3; } int ccCheckWorkGroupSize(ccKernel kernel) { size_t maxWorkGroupSize; cl_int err; err = clGetKernelWorkGroupInfo(kernel.kernel, devices[0], CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &maxWorkGroupSize, NULL); checkErr(err, "clGetKernelWorkGroupInfo"); if (kernel.local_work_size[0]*kernel.local_work_size[1]*kernel.local_work_size[2]>maxWorkGroupSize) { fprintf(stderr,"prod(local_work_size[*])>maxWorkGroupSize (%i>%i)\n", kernel.local_work_size[0]*kernel.local_work_size[1]*kernel.local_work_size[2], maxWorkGroupSize); return CL_INVALID_WORK_DIMENSION; } return CL_SUCCESS; } cl_mem createSubBuffer(p2_t p2, cl_mem_flags flags) { cl_mem sub; cl_int err; cl_buffer_region region; region.origin = (p2.row_offset+p2.col_offset*p2.lda)*sizeof(double); region.size = p2.size - region.origin; sub = clCreateSubBuffer(p2.mem, flags, CL_BUFFER_CREATE_TYPE_REGION, &region, &err); checkErr(err, "clCreateSubBuffer"); return sub; } cl_int ccEnqueueNDRangeKernel(cl_command_queue command_queue, ccKernel kernel, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) { return clEnqueueNDRangeKernel(command_queue, kernel.kernel, kernel.work_dim, NULL, kernel.global_work_size, kernel.local_work_size, num_events_in_wait_list, event_wait_list, event); } void gpu_transpose(int rows, int cols, p2_t dst, p2_t src) { cl_int err; cl_event event; cl_mem subDST, subSRC; if( rows <= 0 || cols <= 0 ) return; ccKernel kernel = cckernel[KERNEL_TRANSPOSE]; // set dimensions ccSetWorkDim(&kernel, 2); ccSetLocalWorkSize(&kernel, 32, 8, 1); ccSetGlobalWorkSize(&kernel, ((rows+31)/32)*32, ((cols+31)/32)*8, 1); err = ccCheckWorkGroupSize(kernel); checkErr(err, "work group too big"); // create pointer to subbuffer subSRC = createSubBuffer(src, CL_MEM_READ_ONLY); subDST = createSubBuffer(dst, CL_MEM_READ_WRITE); // set kernel arguments clSetKernelArg(kernel.kernel, 0, sizeof(cl_mem), &subDST); clSetKernelArg(kernel.kernel, 1, sizeof(cl_uint), &dst.lda); clSetKernelArg(kernel.kernel, 2, sizeof(cl_mem), &subSRC); clSetKernelArg(kernel.kernel, 3, sizeof(cl_uint), &src.lda); // queue the kernel err = ccEnqueueNDRangeKernel(queue, kernel, 0, NULL, &event); checkErr(err, "clEnqueueNDRangeKernel"); clWaitForEvents(1, &event); checkErr(err, "clWaitForEvents"); // cleanup err = clReleaseMemObject(subSRC); checkErr(err, "clReleaseMemObject\n"); err = clReleaseMemObject(subDST); checkErr(err, "clReleaseMemObject\n"); } void gpu_free(p2_t matrix) { cl_int err; err = clReleaseMemObject(matrix.mem); checkErr(err, "gpu_free\n"); } int validate(int rows, int cols, p2_t A, p2_t B, double maxabsdiff) { double mindiff,maxdiff; if (A.lda != B.lda) return TEST_FAIL; mindiff = maxdiff = A.A[0]-B.A[0]; for(int i=0; i<rows; i++) for(int j=0; j<cols; j++) { if (_isnan(A.A[i+j*A.lda]) || _isnan(B.A[i+j*B.lda])) { fprintf(stderr, "NaN!!!\n"); exit(0); } double diff = A.A[i+j*A.lda] - B.A[i+j*B.lda]; if (diff < mindiff) mindiff = diff; if (diff > maxdiff) maxdiff = diff; } fprintf(stderr,"\tmin error: %e\n\tmax error: %e\n", mindiff, maxdiff); if (maxdiff > maxabsdiff || abs(mindiff) > maxabsdiff) return TEST_FAIL; return TEST_OK; } int test_gpu_transpose(int rows, int cols) { int err; // allocate cpu buffers double *Acpubuffer = (double*)malloc(rows*cols*sizeof(double)); double *Bcpubuffer = (double*)malloc(rows*cols*sizeof(double)); p2_t Acpu(Acpubuffer, rows); p2_t Bcpu(Bcpubuffer, rows); // allocate gpu buffer p2_t Agpu = gpu_malloc2D(rows,cols); p2_t Bgpu = gpu_malloc2D(rows,cols); // initialise cpu buffer for(int i=0; i<rows; i++) for(int j=0; j<cols; j++) { Acpu.at(i,j) = (double)(i); // initial Bcpu.at(j,i) = (double)(i); // must match this } // upload/transpose/download of whole matrix upload(rows,cols,Agpu,Acpu); gpu_transpose(rows, cols, Bgpu, Agpu); download(rows,cols,Acpu,Bgpu); // success? if ((err=validate(rows, cols, Acpu, Bcpu, 0.0))!=TEST_OK) { fprintf(stderr,"transpose failed!\n"); } else { fprintf(stderr,"transpose success\n"); } // free resources gpu_free(Agpu); gpu_free(Bgpu); free(Acpubuffer); free(Bcpubuffer); return err; } void ccInitialiseOpenCL(cl_device_type device_type) { cl_int err; cl_uint num_platforms; cl_platform_id *platforms; cl_context_properties cprops[3]; size_t contextInfoSize; cl_device_id *contextInfo; err = clGetPlatformIDs(0, NULL, &num_platforms); checkErr(err, "clGetPlatformIDs:getSize"); platforms = (cl_platform_id*)malloc(num_platforms*sizeof(cl_platform_id)); err = clGetPlatformIDs(num_platforms, platforms, NULL); checkErr(err, "clGetPlatformIDs:getList"); // use only the first platform (platforms[0]) err = clGetDeviceIDs(platforms[0], device_type /*CL_DEVICE_TYPE_GPU*/, 0, NULL, &num_devices); // CL_DEVICE_TYPE_CPU CL_DEVICE_TYPE_GPU checkErr(err, "clGetDeviceIDs:getSize"); devices = (cl_device_id*)malloc(num_devices*sizeof(cl_device_id)); err = clGetDeviceIDs(platforms[0], device_type /*CL_DEVICE_TYPE_GPU*/, num_devices, devices, NULL); checkErr(err, "clGetDeviceIDs:getList"); // Run only on the first device (devices[0] with num_devices = 1) num_devices = 1; cprops[0] = CL_CONTEXT_PLATFORM; cprops[1] = (cl_context_properties)platforms[0]; cprops[2] = 0; context = clCreateContext(cprops, num_devices, devices, NULL, NULL, &err); checkErr(err, "clCreateContext"); err = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &contextInfoSize); checkErr(err, "clGetContextInfo:getSize"); contextInfo = (cl_device_id*)malloc(contextInfoSize); err = clGetContextInfo(context, CL_CONTEXT_DEVICES, contextInfoSize, contextInfo, NULL); checkErr(err, "clGetContextInfo:getSize"); free(platforms); free(contextInfo); } int ae_load_file_to_memory(const char *filename, char **result) { int size = 0; FILE *f = fopen(filename, "rb"); if (f == NULL) { *result = NULL; return -1; // -1 means file opening fail } fseek(f, 0, SEEK_END); size = ftell(f); fseek(f, 0, SEEK_SET); *result = (char *)malloc(size+1); if (size != fread(*result, sizeof(char), size, f)) { free(*result); return -2; // -2 means file reading fail } fclose(f); (*result)[size] = 0; return size; } void ccCreateProgram(const char *prog, const size_t prog_size, cl_int *err) { program = clCreateProgramWithSource(context, 1, &prog, &prog_size, err); checkErr(*err, "clCreateProgramWithSource"); *err = clBuildProgram(program, num_devices, devices, "", NULL, NULL); checkErr(*err, "clBuildProgram"); } typedef struct { int nr; char name[25]; } KernelArray; int main(int argc, char *argv[]) { cl_int err; char *prog; int prog_size; KernelArray kernelArray[] = { { KERNEL_TRANSPOSE, "transpose" } }; // setup some reasonable defaults for OpenCL ccInitialiseOpenCL(CL_DEVICE_TYPE_GPU); // load our kernel prog_size = ae_load_file_to_memory("invMatrixLU.cl", &prog); if (prog_size < 0) { fprintf(stderr, "cannot open invMatrixLU.cl (%i)", prog_size); exit(-1); } ccCreateProgram(prog, prog_size, &err); // create kernels for (int i=0; i<NUM_KERNELS; i++) { cckernel[kernelArray[i].nr].kernel = clCreateKernel(program, kernelArray[i].name, &err); checkErr(err, "ccCreateKernel %s failed", kernelArray[i].name); } // create a queue into which we can push kernels queue = clCreateCommandQueue(context, devices[0], NULL /* CL_QUEUE_PROFILING_ENABLE /**/, &err); checkErr(err, "ccCreateCommandQueue"); // test fprintf(stderr,"starting test\n"); test_gpu_transpose(256, 256); } KERNEL.CL: #ifdef cl_khr_fp64 #pragma OPENCL EXTENSION cl_khr_fp64: enable #endif #ifdef cl_amd_fp64 #pragma OPENCL EXTENSION cl_amd_fp64: enable #endif #define BLOCK_SIZE 8 kernel void transpose( global double *dst, int ldd, global double *src, int lds ) { int2 blockIdx = { get_group_id(0), get_group_id(1) }; int2 threadIdx = { get_local_id(0), get_local_id(1) }; src += blockIdx.x*32 + threadIdx.x + ( blockIdx.y*32 + threadIdx.y ) * lds; dst += blockIdx.y*32 + threadIdx.x + ( blockIdx.x*32 + threadIdx.y ) * ldd; local double a[32][33]; // // load 32x32 block // for( int i = 0; i < 32; i += BLOCK_SIZE ) a[i+threadIdx.y][threadIdx.x] = src[i*lds]; barrier(CLK_LOCAL_MEM_FENCE); // // store transposed block // for( int i = 0; i < 32; i += BLOCK_SIZE ) dst[i*ldd] = a[threadIdx.x][i+threadIdx.y]; }

                              • Transpose kernel fails
                                himanshu.gautam

                                galmok,

                                I tried your kernel on my juniper. I had to convert it to a float application(No DP Support on Juniper). The kernel gives minerrror and maxerror both as zero for both GPU & CPU. So I think this has been fixed in the internal SDK . 

                                Still I will check that on a DP supported device when I get a chance.

                                Thanks for reporting this.