the size of kernel code is very big so I would try to demonstrate it with a simple example such as given below:
void d_memcpy1(uchar *dst, __global uchar *src, uint len)
{
int i = 0;
for(i = 0; i < len; i++)
dst = src;
}
void d_memcpy2(__global uchar *dst, uchar *src, uint len)
{
int i;
for(i = 0; i < len; i++)
dst = src;
}
__kernel void demoKernel(__global uchar *d_io_2d,
__constant uint *d_common_input,
int d_maxSize)
{
long index = get_global_id(0);
if(i >= d_maxSize)
return;
uchar msg0[16], msg1[16];
//reading 32-byte data from global memory
d_memcpy1(msg0, &d_io_2d[(index*16)*2], 16], 16); //reading first 16-byte consecutive data
d_memcpy1(msg1, &d_io_2d[((index*16)*2) + 16], 16); //reading next 16-byte consecutive data
//doing some stuff on these 32-byte data
//writing back 32-byte data to same global memory
d_memcpy2(&d_io_2d[(index*16)*2], 16], msg0, 16); //writing first 16-byte consecutive data
d_memcpy2(&d_io_2d[((index*16)*2) + 16], msg1, 16); //writing next16-byte consecutive data
}
and I am using machine with the following detail,
GPU : Tahiti,
Catalyst Driver : 9.0.2
APP SDK version : OpenCL 1.2
OS : Ubuntu
bitness: 64-bit
I hope this program would be helpful.
For the first 16bytes my result was wrong, but for the next 16bytes result was correct when i changed compiler option="-g -O0" to option=NULL. Again when i set compiler option="-g -O0", i got correct result for both (first and next 16bytes) data.
Note: In Nvidia Tesla K20m card, the same program is working correctly.