I wrote a code where GPU writes some data in SVM buffer and CPU thread reads it. However every time I read it in CPU, I receive few 0's. If I add a delay of 1ns then all the results were correct. So I assume it is something related to memory consistency i.e. opencl does not guarantee when data will be available to CPU until kernel terminates. Interesting part is if I use CL_MEM_SVM_ATOMICS then even without a delay all the results were correct. Though I am not using any atomic operation anywhere. Can someone please let me understand why just using a flag CL_MEM_SVM_ATOMICS changes the results? What exactly happens differently in memory when I use CL_MEM_SVM_ATOMICS without using any atomic operation? I could not find this answer anywhere.
Thank you for sharing this interesting observation. I don't have any explanation at this moment. As I know, the above behavior is not guaranteed as per the OpenCL standard. I've already shared your query with our engg. team. I'll come back once I've their reply.
Meanwhile, please share a repro and the setup information where you've observed the behavior.
Sorry for responding so late. I am trying to recreate it in a sample program so that I can post it here. It seems like it is not easy to recreate. However I will paste the snippet of the program that is actually producing such result.
testout[thread_id].x = 100;
testout[thread_id].y = testin[thread_id].timestamp;
global_memcpy(testout[thread_id].buf3, b3 , 16);
global_memcpy(testout[thread_id].buf2, b2 , 16);
global_memcpy(testout[thread_id].buf1, b1, 8);
global_memcpy(testout[thread_id].buf4, b4, 32);
where global_memcpy is defined as follows:
void global_memcpy(__global u8 *dest, u8 *src, size_t n)
// Typecast src and dest addresses to (char *)
//char *csrc = (char *)src;
//char *cdest = (char *)dest;
// Copy contents of src to dest
for (int i=0; i<n; i++)
dest = src;
both testin and testout are created using CL_MEM_SVM_FINE_GRAIN_BUFFER.
b1,b2,b3,b4 are private buffers created and computed per thread.
Even if I assume my computation is wrong from above code value of x should always be 100 at the cpu output. But thats not the case, it prints 0 sometimes. I am using persistant kernel that I terminates after output is read by cpu for all the input data. At one go only 512 data inputs will be processed.
Platform: AMD A12-9800 RADEON R7, 12 COMPUTE CORES 4C+8G
OS: ubuntu 14.04.1
linux kernel: 3.13.0-133-generic (64 bit)
opencl version: 2.0 (AMDAPPSDK-3.0)
Thanks for sharing the above information.
Actually a complete repro would be more helpful in this case. No problem if it takes some time to generate the repro. Once you've it, please share with us.