6 Replies Latest reply on Oct 26, 2011 5:58 PM by firemars

    Coherence between CPU and GPU in Fusion APU

    firemars

      If I allocate a memory buffer "D" in system memory. And launch two kernerls for CPU and GPU at the same time.

      If GPU use atomic operation on "D", and CPU also use atomic operation on "D".
      Can I get a coherent result?
      Thanks


        • Coherence between CPU and GPU in Fusion APU
          MicahVillmow
          We have a test case that does just this internally, so it does work.
            • Coherence between CPU and GPU in Fusion APU
              firemars

              Thanks for your reply.

              Actually, I also did some experiments on that.

              It seems like that sometimes I can get coherent result, but sometimes not, whatever I use USWC or not.

              When I just run one kernel on CPU or GPU, the result is good under atomic operations between threads. But it does not work very well when I run two kernels at the same time.

              I am interested in how the atomic operations work in two different kernels.  And if I use atomic operation on CPU (use C, not in kernel), can I also get coherent result with the atomic operation inside kernel.

              Thanks

            • Coherence between CPU and GPU in Fusion APU
              MicahVillmow
              firemars,
              I believe this will only work with an APU, not with a discrete. There are no atomic operations across the PCI bus so the atomicity is broken at that point.
                • Coherence between CPU and GPU in Fusion APU
                  firemars

                  Hi Micah,

                  Thanks for your replying. 

                  I did the experiment on APU, which is A8-3850. The OS is windows 7. 

                  I launched two kernels, one for CPU, the other for GPU.

                  Both of the kernels will do atomic_add on one memory object, which is create by CL_MEM_ALLOC_HOST_PTR.

                  As what I seen, the atomic operations between these two kernels can not be guarantee. 

                  I think the reason might be that CPU use write combine buffer. Thus, GPU cannot read the most latest data. 

                  If it is true, is there any way to make sure CPU flush write to the memory. 

                  Thanks

                    • Coherence between CPU and GPU in Fusion APU
                      genaganna

                       

                      Originally posted by: firemars Hi Micah,

                       

                      Thanks for your replying. 

                       

                      I did the experiment on APU, which is A8-3850. The OS is windows 7. 

                       

                      I launched two kernels, one for CPU, the other for GPU.

                       

                      Both of the kernels will do atomic_add on one memory object, which is create by CL_MEM_ALLOC_HOST_PTR.

                       

                      As what I seen, the atomic operations between these two kernels can not be guarantee. 

                       

                      I think the reason might be that CPU use write combine buffer. Thus, GPU cannot read the most latest data. 

                       

                      If it is true, is there any way to make sure CPU flush write to the memory. 

                       

                      Thanks

                       

                      firemars,

                      It is not write combine buffer if buffer is created with CL_MEM_ALLOC_HOST_PTR.

                      It looks like some thing is going wrong. Could you please paste your experimental code here?

                        • Coherence between CPU and GPU in Fusion APU
                          firemars

                          Because the code of my experiment is too long, I copy some relating part right here. The program works well when only use cpu or gpu.

                          However, use cpu+gpu, there is problem on coherency like the situation in the following.

                          The work did by CPU:

                          CPU: data: 1; 

                          CPU: data: 2; 

                          CPU: data 3; 

                          GPU: data: 2; 

                          CPU: data: 3;

                          CPU: data 4;

                          data 3 is repeated done by CPU.

                          Another thing is that if I increase the workload for each data, this situation is reduced.

                           

                          ///////////////////////////////main.cpp///////////////////////////////////////////
                          struct Shared_data
                          {
                          int data;
                          } shared_data ;

                          cl_mem d_shared_data;

                          main ()
                          {
                          cl_int err;
                          shared_data.data = 0;
                          d_shared_data = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(struct Shared_data), NULL, &err);

                          struct Shared_data *p_shared_data = clEnqueueMapBuffer(command_queue, d_shared_data, CL_TRUE, CL_MAP_WRITE, 0, sizeof(struct Shared_data), 0, NULL, NULL, &err);
                          memcpy(p_shared_data, &shared_data, sizeof(struct Shared_data));
                          clEnqueueUnmapMemObject(command_queue, d_shared_data, p_shared_data, 0, NULL, NULL);

                          create_thread(work) //for gpu
                          create_thread(work) //for cpu
                          }

                          void work()
                          {
                          clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_shared_data);

                          clEnqueueNDRangekernel(command_queue, kernel, 1, NULL, global, local, 0, NULL, NULL)
                          }

                          ////////////////////////////Kernel.cl////////////////////////////////////////////
                          #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics: enable
                          __kernel fun(volatile __global struct Shared_data *shared_data)
                          {
                          int local_id = get_local_id(0);
                          __local int s_data[];

                          while(1)
                          {
                          if(local_id == 0)
                          {
                          s_data[0] = atomic_add(&(shared_data->data),1);
                          }
                          barrier(CLK_LOCAL_MEM_FENCE);

                          if(s_data[0] > threshold)
                          break;

                          //do work on s_data[0]
                          ...........
                          }
                          }