Transpose kernel fails

Discussion created by galmok on Mar 25, 2011
Latest reply on May 13, 2011 by himanshu.gautam
(works fine on CPU and nVidia)

I have a transpose kernel taken directly from cuda and converted to OpenCL and while the converted kernel works perfectly on the CPU and on nVidia's OpenCL implementation, it fails on AMD OpenCL.

As the kernel is a direct conversion, no optimisations or anything else have been done with it. The local work group size is set to 32,8,0, i.e. a workgroupsize of 256. Checking the workgroupsize with clGetKernelWorkGroupInfo lets the kernel pass and I assume this means it should run without error caused by the chosen work group size. Is that assumption wrong?

I know AMD cards prefer a work group size of 64...

The kernel arguments are these: dst = destination array, ldd = destination stride, src = source array, lds = source stride.

My own way to setup the dimensions is like this:

    ccSetWorkDim(&kernel, 2);
    ccSetLocalWorkSize(&kernel, 32, 8, 1);
    ccSetGlobalWorkSize(&kernel, ((rows+31)/32)*32, ((cols+31)/32)*8, 1);

And I call the kernel like this:

ccSetKernelArgs(kernel, 4, "cl_mem", &subDST, "cl_uint", dst.lda, "cl_mem", &subSRC, "cl_uint", src.lda);

Funny thing is, I have an inplace transpose using the same workgroup size and it works just fine.

kernel void transpose( global double *dst, int ldd, global double *src, int lds ) { int2 blockIdx = { get_group_id(0), get_group_id(1) }; int2 threadIdx = { get_local_id(0), get_local_id(1) }; src += blockIdx.x*32 + threadIdx.x + ( blockIdx.y*32 + threadIdx.y ) * lds; dst += blockIdx.y*32 + threadIdx.x + ( blockIdx.x*32 + threadIdx.y ) * ldd; local double a[32][33]; // // load 32x32 block // for( int i = 0; i < 32; i += BLOCK_SIZE ) a[i+threadIdx.y][threadIdx.x] = src[i*lds]; barrier(CLK_LOCAL_MEM_FENCE); // // store transposed block // for( int i = 0; i < 32; i += BLOCK_SIZE ) dst[i*ldd] = a[threadIdx.x][i+threadIdx.y]; }