Hi Dipak,
Sorry for responding so late. I am trying to recreate it in a sample program so that I can post it here. It seems like it is not easy to recreate. However I will paste the snippet of the program that is actually producing such result.
testout[thread_id].x = 100;
testout[thread_id].y = testin[thread_id].timestamp;
global_memcpy(testout[thread_id].buf3, b3 , 16);
global_memcpy(testout[thread_id].buf2, b2 , 16);
global_memcpy(testout[thread_id].buf1, b1, 8);
global_memcpy(testout[thread_id].buf4, b4, 32);
where global_memcpy is defined as follows:
void global_memcpy(__global u8 *dest, u8 *src, size_t n)
{
// Typecast src and dest addresses to (char *)
//char *csrc = (char *)src;
//char *cdest = (char *)dest;
// Copy contents of src[] to dest[]
for (int i=0; i<n; i++)
dest = src;
}
both testin and testout are created using CL_MEM_SVM_FINE_GRAIN_BUFFER.
b1,b2,b3,b4 are private buffers created and computed per thread.
Even if I assume my computation is wrong from above code value of x should always be 100 at the cpu output. But thats not the case, it prints 0 sometimes. I am using persistant kernel that I terminates after output is read by cpu for all the input data. At one go only 512 data inputs will be processed.
Platform: AMD A12-9800 RADEON R7, 12 COMPUTE CORES 4C+8G
OS: ubuntu 14.04.1
linux kernel: 3.13.0-133-generic (64 bit)
driver: fglrx
opencl version: 2.0 (AMDAPPSDK-3.0)