29 Replies Latest reply on May 13, 2013 1:30 AM by himanshu.gautam

    Concurrent kernel execution between CPU and GPU

    hschu

      Hi. I am trying to verify simple heterogenous computing using a CPU and a GPU using OpenCL. The Kernel function is a simple BLAS level 1 saxpy (single-precision, scalar multplication and vector addition) algorithm, and I assigned "n" numbers of elements to the CPU and "nn-n" to the GPU, where "nn" is the vector length. Moving n variable, I wanted to figure out a splitting point "n" that minimizes whole computational time.

       

      In order to get the ideal splitting point, OpenCL should guarantee a concurrency under heterogeneous system. So I tried to verify that concurrency by testing a simple program as follows.

       

         CPerfCounter t1;

         ...

         t1.Reset();
         t1.Start();

         // Enqueue to write the target vectors x and y to GPU Global memory.
         clEnqueueWriteBuffer(cqCommandQueue_gpu, cl_x, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), x, 0, NULL, NULL);
         clEnqueueWriteBuffer(cqCommandQueue_gpu, cl_y, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), y, 0, NULL, NULL);

        
         // Enqueue NDRange to CPU
         err = clEnqueueNDRangeKernel(cqCommandQueue_cpu, ckKernel[1], 1, NULL, &GWS2, &LWS2, 0, NULL, NULL);

         // Enqueue NDRange to GPU
         clEnqueueNDRangeKernel(cqCommandQueue_gpu, ckKernel[0], 1, NULL, &GWS, &LWS, 0, NULL, NULL);
        
         // Enqueue to read the result vector to Host memory
         clEnqueueReadBuffer(cqCommandQueue_gpu, cl_y, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), z, 0, NULL, NULL);
         //clFlush(cqCommandQueue_gpu);

         clFlush(cqCommandQueue_cpu);
         clFinish(cqCommandQueue_gpu);
         clFinish(cqCommandQueue_cpu);
         t1.Stop();
           
      I intentionally remove "clFlush(cqCommandQueue_gpu)" since there were no big differences about results. Here is profile information using AMD Profiler. I found out some strang results.

       

      Case1. Not executed in parallel

      1.png

       

      Case2. Working properly

      2.png

       

      Case3. Strangely PCI express holds data while CPU computes

      3.png

       

      How can I analyze these results?

      Thanks in advance.

       

      ------------- My information

      Windows 7 64-bit, VS 2010

      CPU : FX 8120

      GPU : Radeon 7970

        • Re: Concurrent kernel execution between CPU and GPU
          himanshu.gautam

          2 kernels updating same cl_mem object at the same time will not work properly.

          Refer to Appendix A of the OpenCL spec.

           

          Instead use sub-buffers (clCreateSubBuffer..). Assign 1 SB to CPU and other to GPU and they can work in parallel...and then read back the mother buffer....All will be well.

            • Re: Concurrent kernel execution between CPU and GPU
              hschu

              Thanks.

               

              I have already assigned indepent buffers for each kernel. Above profile was done by independent buffers.

               

                 // Create buffer objects
                 cl_x = clCreateBuffer(cxContext, CL_MEM_READ_ONLY,   sizeof(FLOAT)*(nn-n), NULL, &err);
                 cl_y = clCreateBuffer(cxContext, CL_MEM_READ_WRITE,  sizeof(FLOAT)*(nn-n), NULL, &err);
                 CheckErr(err);

               

                 cl_xc = clCreateBuffer(cxContext, CL_MEM_READ_ONLY  | CL_MEM_USE_HOST_PTR, sizeof(FLOAT)*n, (void *)xc, &err);
                 cl_yc = clCreateBuffer(cxContext, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(FLOAT)*n, (void *)yc, &err);
                 CheckErr(err);

                • Re: Concurrent kernel execution between CPU and GPU
                  himanshu.gautam

                  clEnqueueWriteBuffer(cqCommandQueue_gpu, cl_x, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), x, 0, NULL, NULL);

                  clEnqueueWriteBuffer(cqCommandQueue_gpu, cl_y, CL_FALSE, 0, sizeof(FLOAT)*(nn-n), y, 0, NULL, NULL);

                   

                  In the code above , Is there a reason why you use the GPU command queue for both the writes?

                  Also, I hope you have set the arguments correctly as well. You may want to re-check.

                    • Re: Concurrent kernel execution between CPU and GPU
                      hschu

                      Thank you for replying my discussion. This will be great help to me.

                       

                      Saxpy algorithm needs to have two vectors.

                      y = a*x + y, where "a" is a scalar variable and x and y are vector with given length. I assigned it as "n" for the CPU and "nn-n" for the GPU, so totally I want to calculate saxpy operation with "nn" vector length. Each device has two vectors x and y, therefore GPU writes two buffers x and y with length "nn-n" which are independent buffers from CPU vectors.

                       

                      I have tested CPU only and GPU only program, and worked well. This is my first time to try heterogeneous computing with OpenCL. But this example is very easy, I expected almost ideal results...

                        • Re: Concurrent kernel execution between CPU and GPU
                          himanshu.gautam

                          Oh okay.. Now I understand...

                          from your original post, it is not clear:

                          1) Are you receiving wrong results?

                          2) Are you receiving correct results but not expected overlapped execution?

                            • Re: Concurrent kernel execution between CPU and GPU
                              hschu

                              I attatched my source codes at the original posting.

                               

                              1) Yes, I got correct results.

                               

                              2) Yes, exactly. I expected overlapped executions.

                               

                              I fixed nn = 50331648(=48K) where n varies from 1K to 47K. Here are timing results with CPU only and GPU only.

                               

                                            n                   nn-n          CPU time(sec)   GPU time(sec)

                                      1048576         49283072        0.0009259        0.3113512

                                      2097152         48234496        0.0019265        0.3030655

                                      3145728         47185920        0.0027454        0.2960925

                                      4194304         46137344        0.0043816        0.2872379

                                      5242880         45088768        0.0046060        0.2832044

                                      6291456         44040192        0.0055247        0.2767915

                                      7340032         42991616        0.0063134        0.2695790

                                      8388608         41943040        0.0076503        0.2652252

                                      9437184         40894464        0.0082551        0.2594063

                                     10485760         39845888        0.0092112        0.2530073

                                     11534336         38797312        0.0101195        0.2437295

                                     12582912         37748736        0.0109935        0.2392208

                                     13631488         36700160        0.0119362        0.2346292

                                     14680064         35651584        0.0129057        0.2290327

                                     15728640         34603008        0.0137355        0.2227570

                                     16777216         33554432        0.0158361        0.2124377

                                     17825792         32505856        0.0157124        0.2050176

                                     18874368         31457280        0.0165814        0.1993729

                                     19922944         30408704        0.0173361        0.1932635

                                     20971520         29360128        0.0188006        0.1867926

                                     22020096         28311552        0.0191749        0.1798617

                                     23068672         27262976        0.0199704        0.1736680

                                     24117248         26214400        0.0211467        0.1651932

                                     25165824         25165824        0.0222448        0.1622620

                                     26214400         24117248        0.0225625        0.1567150

                                     27262976         23068672        0.0232405        0.1501289

                                     28311552         22020096        0.0247861        0.1425812

                                     29360128         20971520        0.0258827        0.1376894

                                     30408704         19922944        0.0258399        0.1315505

                                     31457280         18874368        0.0267882        0.1256371

                                     32505856         17825792        0.0287400        0.1184132

                                     33554432         16777216        0.0283369        0.1076324

                                     34603008         15728640        0.0290841        0.0996051

                                     35651584         14680064        0.0298546        0.0935058

                                     36700160         13631488        0.0308073        0.0863872

                                     37748736         12582912        0.0324147        0.0819703

                                     38797312         11534336        0.0329809        0.0750496

                                     39845888         10485760        0.0332961        0.0686574

                                     40894464          9437184        0.0340249        0.0611951

                                     41943040          8388608        0.0349598        0.0551111

                                     42991616          7340032        0.0356537        0.0483755

                                     44040192          6291456        0.0364748        0.0426752

                                     45088768          5242880        0.0371984        0.0357001

                                     46137344          4194304        0.0388203        0.0290992

                                     47185920          3145728        0.0388292        0.0225611

                                     48234496          2097152        0.0400297        0.0168223

                                     49283072          1048576        0.0415698        0.0123415

                               

                              I want get a result of hetergeneous computing as max(CPUtime(n),GPUtime(nn-n)). But above posted program results in nothing but (CPUtime(n)+GPUtime(nn-n)). This is why I raised a question and I showed the profiling figures.

                    • Re: Concurrent kernel execution between CPU and GPU
                      hschu

                      GPU : 2 WriteBuffer, 1 NDRangeKernel, 1 ReadBuffer

                      CPU : 1 NDRageKernel

                       

                      Above profiling figures are parts of single profiling information.

                       

                      I categorized three cases based on behaviors. What I want to see is whether NDRangeKernel of the CPU and the first WriteBuffer of the GPU start at almost same time line or not.

                       

                      Case1.

                      The CPU and GPU executions are serialized.  This case shows that after the CPU had finished its kernel computation, the first WriteBuffer of the GPU have started to transfer.

                       

                      Case2.

                      Both NDRangeKernel of the CPU and the first WriteBuffer of the GPU have started in almost similar time line. This is what I exactly expected.

                       

                      Case3.

                      NDRangeKernel of the CPU and the first WriteBuffer of the GPU are executed at similar time line, but the first WriteBuffer of the GPU took a long time compared with usual case. The PCIe bandwidth of my machine is usually up to 3.0GB/s writing, but in this first buffer case it is 300~500MB/s. In my opinion, the CPU is busy for computing saxpy, so PCIe handling is somewhat delayed since I assigned work-group size as 1024, which is the maximum number of work-group size.

                       

                      On the other hands, some interesting patterns have been found.

                      Case1 and Case3 occured when the CPU computes more than the GPU. If GPU is in heavy computation, Case2 was observed.

                        • Re: Concurrent kernel execution between CPU and GPU
                          himanshu.gautam

                          You have listed 3 cases above. So, I believe you randomly hit one of these cases as you run your application many number of times. Is that right?

                           

                          Also, Please let me know if clWaitForEvents() helped you... And yes, Please clFlush() on all CQs before waiting...

                           

                          Just a word of caution on Sub-buffer idea that I mentioned above:

                          I think, sub-buffer is an OpenCL 1.2 concept -- May not work on OpenCL vendors who have not matured to 1.2 yet, like for example: nvidia.

                            • Re: Concurrent kernel execution between CPU and GPU
                              hschu

                              That is right.

                               

                              I have two test machines with almost same devices, call those two machines A and B. For some reasons, AMD profiler is not working properly in machine A. But I was able to get timing results without profiling data from machine A using clWaitForEvents(). The result shows almost same trend like (CPU_times(n)+GPU_times(nn-n)) rather than max(CPU_times(n),GPU_times(nn-n)).

                               

                              I will profile soon using machine B, which is able to profile the program.

                              • Re: Concurrent kernel execution between CPU and GPU
                                hschu

                                Now I tested with profiler.

                                 

                                Case 1, 2, and 3 also observed again randomly.

                                I applied clWaitForEvent() and AMD ADK Timer which uses QueryPerformanceCounter(). I put clFlush() in all equeueing functions and removed CL_PROFILEING_ENABLE flags for each commandqueue.

                                 

                                I attatched profile information at the original post.

                                 

                                Thank you.

                                  • Re: Concurrent kernel execution between CPU and GPU
                                    himanshu.gautam

                                    "Case1 and Case3 occured when the CPU computes more than the GPU. If GPU is in heavy computation, Case2 was observed."

                                     

                                    One quick suggestion is to use device fission extension. This should block one CPU core for handling host-device communications, and rest of them can used as CPU device. Anyways i will try to read through your code, and will let you know.

                                      • Re: Concurrent kernel execution between CPU and GPU
                                        hschu

                                        Thank you. I don't have any APU right now, so someone may try to test thist simple code.

                                         

                                        My final goal is to find spliting ratio between a CPU and a GPU for saxypy (BLAS level 1), matrix vector multiplication (BLAS level 2), and matrix matrix multiplication (BLAS level 3). At first, I want to get real data reasonably, and then analyze it with hardware parameters like memory bandwidth, flops, kernel overheads, and latencies. Finally, I may be able to express this real data in terms of time function with given "nn" variable.

                                         

                                        And I have two simple questions.

                                         

                                        1. What is the ideal work-group size for CPU?

                                        I know it is definitely depending on the kenel. But in the case of saxpy operation, this is very light and shows very clear results when "nn" increases. I have AMD FX 8120, which has eight cores, and this shows peak performance when I assigned the maximum work-group size as 1024. In my thought, since FX 8120 has 8 cores, it will be natural to assign work-group size as 8. But in this case, the performance is much lower than the 1024 case.

                                         

                                        2. Max memory allocation

                                        HD 7950 has 2GB global memory, and "Global memory size" in clinfo also shows 2GB. But "Max memory allocation" shows 25% of 2GB, which becomes 512MB... I have read some posts that the "Max memory allocation" will be about 60% of actual global memory size. Moreover, CPU clinfo also shows 25% of its global memory size. How can I use more "Max memory allocation"?

                                          • Re: Concurrent kernel execution between CPU and GPU
                                            himanshu.gautam

                                            Max Mem Allocation, I believe, is for 1 OpenCL clCreateBuffer() call. You can still use the rest of memory with multiple allocations.

                                            Please cross-check with the spec on the exact paraeter which you are querying...

                                            +

                                            Device Fission has nothing to do with APU. You can run the DeviceFission sample in the APP SDK and check out the code

                                              • Re: Concurrent kernel execution between CPU and GPU
                                                hschu

                                                Okay. I understand. Thank you for kind reply.

                                                 

                                                I will look at the DeviceFission SDK and try that.

                                                • Re: Concurrent kernel execution between CPU and GPU
                                                  hschu

                                                  I got some results. I will start with a claim.

                                                   

                                                  [CLAIM]

                                                  WriteBuffer() to the GPU device memory is delayed when the CPU runs heavy kernel.

                                                   

                                                  By using clCreateSubDevice(), I assigned 4 cores of 8 cores CPU (FX 8120) as SubDevice_CPU so that remaining 4 cores can do some I/O operations.

                                                   

                                                  Here are results. Still it does not show max(CPU_time, GPU_time).

                                                  a.png

                                                  Magnified figure around cross section.

                                                  a_mag.png

                                                   

                                                  I want to see further, but the CL_MEM_OBJECT_ALLOCATION_FAILURE kept showing whenever totally allocated GPU buffers (not one buffer object) exceed the MAX MEMORY ALLOCATION.

                                                   

                                                  Now I am trying to profile this result. How do you think about my claim?

                                                    • Re: Concurrent kernel execution between CPU and GPU
                                                      nou

                                                      CL_MEM_OBJECT_ALLOCATION_FAILURE I propoesd long ago that OpenCL runtime should swap buffers from device memory if it can't fit them together and return error only when sum off all buffers needed for current kernel execution exceed device memory size.

                                                       

                                                      What is on X axis?

                                                        • Re: Concurrent kernel execution between CPU and GPU
                                                          hschu

                                                          I did a mistake at labeling X axis. Through this post, the X axis should be "n". I have fixed "nn" which is the length of each vector in saxpy operation. I gave "n" numbers of element to the CPU and "nn-n" numbers of element to the GPU by vaying "n" from 1K to 47K or 63K with step size 1K.

                                                           

                                                          nou, Could explain more about CL_MEM_OBJECT_ALLOCATION_FAILURE? What was "device memory size"? Is that "Max memory allocation" or "Global memory size" in clinfo?

                                                           

                                                          Thanks.

                                                        • Re: Concurrent kernel execution between CPU and GPU
                                                          himanshu.gautam

                                                          Thanks for sharing the results. It looks interesting. Can you also attach your latest code (with the latest reply). Will try to run it here.

                                                          • Re: Concurrent kernel execution between CPU and GPU
                                                            hschu

                                                            Thanks for himanshu.gautam and nou.

                                                             

                                                            I tried BLAS level 2 matrix vector multiplication and I got almost similar results with BLAS level 1 saxpy algorithm. The possible reason why thoes figures showed somewhat identical behavior is that BLAS algorithms tend to be data transfer dominant algorithms. Here is the result.

                                                             

                                                             

                                                            7.png

                                                            The green line was added to express just sum of CPU and GPU (the red line value + the blue line value). I have n times n matrix "A", and moving variable k means k times n matrix to the CPU and (n-k) times n matrix to the GPU.

                                                             

                                                            There exists some overheads around cross section point and around starting position. I have a lot of interests in the overhead around cross section point and want to figure out why.

                                                             

                                                            I have looked at all useful links you recommended and some posts on the web about maximum memory allocation. I was able to get proper size of "Max alloc memory" using GPU_MAX_ALLOC_PERCENT=100. I double checked it using memtestCL by passing maximum available memory, and it worked fine. Strangely, the program I attatched in my original post kept showing same error CL_MEM_OBJECT_ALLOCATION_FAILURE even though I set the  GPU_MAX_ALLOC_PERCENT variable correctly. And even the clGetDeviceInfo() for CL_DEVICE_MAX_MEM_ALLOC_SIZE tells me right information. The memory object error comes up whenever I allocate over 512MB totally which was the original max memory allocation value before setting GPU_MAX_ALLOC_PERCENT as 100. The one thing left to do is reinstalling OS. Does anyone else have the same problem?

                                                              • Re: Concurrent kernel execution between CPU and GPU
                                                                himanshu.gautam

                                                                Are you using any flags while allocating the buffer? (As per your attachment - there is none except Read/write).

                                                                Also, I see that you are allocating many buffers (some with USE_HOST_PTR) as well.

                                                                Are you sure, if you are not running out of Host memory? If you are allocating host buffers of size 512MB, your mallocs() may also be failing.

                                                                If you post your exact code here, (if different from the attachment in first post), I will try it here.

                                                                Thanks,