Showing results for 
Search instead for 
Did you mean: 


Adept I

I am trying to use SVM for data sharing between CPU and GPU. However I have a question about CL_MEM_SVM_ATOMICS flag

I wrote a code where GPU writes some data in SVM buffer and CPU thread reads it. However every time I read it in CPU, I receive few 0's. If I add a delay of 1ns then all the results were correct. So I assume it is something related to memory consistency i.e. opencl does not guarantee when data will be available to CPU until kernel terminates. Interesting part is if I use CL_MEM_SVM_ATOMICS then even without a delay all the results were correct. Though I am not using any atomic operation anywhere. Can someone please let me understand why just using a flag CL_MEM_SVM_ATOMICS changes the results? What exactly happens differently in memory when I use CL_MEM_SVM_ATOMICS without using any atomic operation? I could not find this answer anywhere.

3 Replies
Big Boss

Hi Avinash,

Thank you for sharing this interesting observation. I don't have any explanation at this moment. As I know, the above behavior is not guaranteed as per the OpenCL standard. I've already shared your query with our engg. team. I'll come back once I've their reply.

Meanwhile, please share a repro and the setup information where you've observed the behavior.



Hi Dipak,

Sorry for responding so late. I am trying to recreate it in a sample program so that I can post it here. It seems like it is not easy to recreate. However I will paste the snippet of the program that is actually producing such result.

        testout[thread_id].x = 100;

        testout[thread_id].y = testin[thread_id].timestamp;

        global_memcpy(testout[thread_id].buf3, b3 , 16);

        global_memcpy(testout[thread_id].buf2, b2 , 16);

        global_memcpy(testout[thread_id].buf1, b1, 8);

        global_memcpy(testout[thread_id].buf4, b4, 32);

where global_memcpy is defined as follows:

void global_memcpy(__global u8 *dest, u8 *src, size_t n)


   // Typecast src and dest addresses to (char *)

   //char *csrc = (char *)src;

   //char *cdest = (char *)dest; 


   // Copy contents of src[] to dest[]

   for (int i=0; i<n; i++)

       dest = src;



both testin and testout are created using CL_MEM_SVM_FINE_GRAIN_BUFFER.

b1,b2,b3,b4 are private buffers created and computed per thread.

Even if I assume my computation is wrong from above code value of x should always be 100 at the cpu output. But thats not the case, it prints 0 sometimes. I am using persistant kernel that I terminates after output is read by cpu for all the input data. At one go only 512 data inputs will be processed.

Platform: AMD A12-9800 RADEON R7, 12 COMPUTE CORES 4C+8G

OS: ubuntu 14.04.1

linux kernel: 3.13.0-133-generic (64 bit)

driver: fglrx

opencl version: 2.0 (AMDAPPSDK-3.0)


Thanks for sharing the above information.

Actually a complete repro would be more helpful in this case. No problem if it takes some time to generate the repro.  Once you've it, please share with us.