4 Replies Latest reply on Aug 14, 2014 7:52 AM by maxdz8

    About the cache coherence between CPU and descrete GPU

    lennox

      Hi all,

      I am studying how to create an interactive connection between a CPU and a descrete CPU on OpenCL.

      I create a buffer(unsigned long) with CL_MEM_ALLOC_HOST_PTR flag, and then I pass the buffer to be a kernel argument.

      I am sure the buffer is pinned in host memory by the report of the CodeXL.

      The kernel code for GPU  is an infinite loop to probe the buffer, if the buffer is not zero, the loop will break.

      After issue the kernel code, the CPU-side program waits for a while and then changes the buffer to 1.

      (I call clEnqueueMapBuffer to get the pointer and call clEnqueueUnmapMemObject after the modification)

      I expect that the GPU could notice the buffer is not zero and then exits the infinite loop.

      However, the result is the loop cannot be terminated.

      I  doubt it is because the GPU has cached the content of the buffer and the vaule is still zero.

      Will the hardware or the dirver keep cache coherence?

      If the answer is no, can I flush the cache of GPU?


        • Re: About the cache coherence between CPU and descrete GPU
          lennox

          I forgot to say that I map and unmap the buffer via another queue.

          And there are some update,

          I tried malloc() and CL_MEM_USE_HOST_PTR, the result is the same.

          I also tried to execute the kernel on a multicore CPU, ans there is no change.

           

          Partial code

          ========

          main thread

          ====

          // Create a command queue
            cl_command_queue command_queue = clCreateCommandQueue(context, device_id,
            CL_QUEUE_PROFILING_ENABLE ,
            &ret);
          
            cl_command_queue memop_queue = clCreateCommandQueue(context, device_id,
            CL_QUEUE_PROFILING_ENABLE ,
            &ret);
            .....
            char *buf_flag = malloc(sizeof(unsigned long));
          
            cl_mem flags=clCreateBuffer(context,CL_MEM_READ_WRITE |CL_MEM_USE_HOST_PTR, sizeof(buf_flag) ,buf_flag, &ret);
            ...
            ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&flags);
            ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
            &global_item_size, &local_item_size, 0, NULL,
            &ndrEvt);
            ...
          while(1){
            *buf_flag=1;
            while(buf_flag);
            printf("kernel ack\n");
            }
          
          
          
          
          
          

           

           

           

          ==============================

          kernel

          =====

          __kernel void test_host(__global unsigned long *flags) {
          
            while(!flags[0]);
          
            do something
          
            flags[0]=0;
          }
          
          
          
          
          
          
            • Re: About the cache coherence between CPU and descrete GPU
              maxdz8

              Hello lennox, it is my understanding what you're trying to do is implementation defined and I can tell you it's not going to work in 99% of the cases before CL2 is released AND you get a SVM-enabled device with fine grained fences.

               

              There are various things even in the 1.2 spec pointing at this and of course I cannot pull out the specific one but anyway:

               

              "clEnqueueMapBuffer, and clEnqueueMapImage act as synchronization points for a region of the buffer object being mapped."

               

              Devices will be very lazy in updating those buffers, especially if they are so small. There's no sync point after buffer creation. You never map it, so there's no request by your side to have the updated contents.

               

              Doing a clFlush might help (doing clFinish would be just better in this specific example) but the correct solution is to request a map or a read.

               

              That's effectively basic multiprocessing: you cannot assume a co-processor gets scheduled ever and you must sync them. It would happen even in x86 multicore, let alone with a GPU which lives far away on its own board.

              1 of 1 people found this helpful
                • Re: Re: About the cache coherence between CPU and descrete GPU
                  lennox

                  I use GDB to look why the kernel cannot exit the loop, and I disassembly the kernel around the PC.

                  I am not familiar to x86 instrunctions, but the code seems pooling a register only, is it right?

                  The kernel argument is a pointer, the compiler seem assuming the content never be changed.

                    0x00007ffff01cb5b1 <+321>: sete %al
                    0x00007ffff01cb5b4 <+324>: data32 data32 nopw %cs:0x0(%rax,%rax,1)
                  => 0x00007ffff01cb5c0 <+336>: test $0x1,%al
                    0x00007ffff01cb5c2 <+338>: mov $0x1,%al
                    0x00007ffff01cb5c4 <+340>: jne 0x7ffff01cb5c0 <__OpenCL_test_host_stub+336>
                  
                  

                   

                  So, I change the type of the kernel argument

                  __kernel void test_host(__global volatile unsigned long *flags)
                  

                  Now, it works on a multicore processor, however, it still doesn't work on GPU.

              • Re: About the cache coherence between CPU and descrete GPU
                maxdz8

                I haven't looked at x86 code in a while and I suggest you to not do that either: you have a consistent logic error at the logical model level.

                 

                This might appear to work on multicore CPUs as they have special hardware constructs (and compiler support) to ensure this - even though they'll go race condition without a guard. You don't use multiple threads without sync points!

                 

                Anyway, there are no such constructs to implicitly sync with GPUs. Global memory isn't even what you probably think it is: it's not really global in the sense of "system", it is global in the sense of "outside of GPU core", not a register nor local. It can be VRAM or PCIex aperture memory or even mapped host... in no case coherency is guaranteed.

                 

                To have guaranteed coherency you need device with Shared Virtual Memory and OpenCL2, which supports the new feature.