3 Replies Latest reply on Jun 25, 2018 5:24 AM by dipak

    I am trying to use SVM for data sharing between CPU and GPU. However I have a question about CL_MEM_SVM_ATOMICS flag

    avinashkrc

      I wrote a code where GPU writes some data in SVM buffer and CPU thread reads it. However every time I read it in CPU, I receive few 0's. If I add a delay of 1ns then all the results were correct. So I assume it is something related to memory consistency i.e. opencl does not guarantee when data will be available to CPU until kernel terminates. Interesting part is if I use CL_MEM_SVM_ATOMICS then even without a delay all the results were correct. Though I am not using any atomic operation anywhere. Can someone please let me understand why just using a flag CL_MEM_SVM_ATOMICS changes the results? What exactly happens differently in memory when I use CL_MEM_SVM_ATOMICS without using any atomic operation? I could not find this answer anywhere.

        • Re: I am trying to use SVM for data sharing between CPU and GPU. However I have a question about CL_MEM_SVM_ATOMICS flag
          dipak

          Hi Avinash,

          Thank you for sharing this interesting observation. I don't have any explanation at this moment. As I know, the above behavior is not guaranteed as per the OpenCL standard. I've already shared your query with our engg. team. I'll come back once I've their reply.

          Meanwhile, please share a repro and the setup information where you've observed the behavior.

           

          Regards,

            • Re: I am trying to use SVM for data sharing between CPU and GPU. However I have a question about CL_MEM_SVM_ATOMICS flag
              avinashkrc

              Hi Dipak,

              Sorry for responding so late. I am trying to recreate it in a sample program so that I can post it here. It seems like it is not easy to recreate. However I will paste the snippet of the program that is actually producing such result.

               

                      testout[thread_id].x = 100;

                      testout[thread_id].y = testin[thread_id].timestamp;

                      global_memcpy(testout[thread_id].buf3, b3 , 16);

                      global_memcpy(testout[thread_id].buf2, b2 , 16);

                      global_memcpy(testout[thread_id].buf1, b1, 8);

                      global_memcpy(testout[thread_id].buf4, b4, 32);

               

               

              where global_memcpy is defined as follows:

              void global_memcpy(__global u8 *dest, u8 *src, size_t n)

              {         

                 // Typecast src and dest addresses to (char *)

                 //char *csrc = (char *)src;

                 //char *cdest = (char *)dest; 

                

                 // Copy contents of src[] to dest[]

                 for (int i=0; i<n; i++)

                     dest[i] = src[i];

              }

               

              both testin and testout are created using CL_MEM_SVM_FINE_GRAIN_BUFFER.

              b1,b2,b3,b4 are private buffers created and computed per thread.

              Even if I assume my computation is wrong from above code value of x should always be 100 at the cpu output. But thats not the case, it prints 0 sometimes. I am using persistant kernel that I terminates after output is read by cpu for all the input data. At one go only 512 data inputs will be processed.

              Platform: AMD A12-9800 RADEON R7, 12 COMPUTE CORES 4C+8G

              OS: ubuntu 14.04.1

              linux kernel: 3.13.0-133-generic (64 bit)

              driver: fglrx

              opencl version: 2.0 (AMDAPPSDK-3.0)