4 Replies Latest reply on Sep 11, 2013 12:34 AM by himanshu.gautam

    clFlush() and Event Profiling Problem.

    pramodv_a

      Hi,

       

      I am trying to parallelize DMA and computation with two in order queues, facing strange Event Profiling results when clFlush() is being used.

      On AMD GPU almost all events recorded with same/invalid time stamps. (Please refer to to the results stated below, marked in red)


      This method is shared by "Nvidia_OpenCL_SDK_4_2_Linux.zip" (oclCopyComputeOverlap.cpp), and works fine on Nvidia GPU's.

      Attached the source code for your reference. (in-order-2queues.cpp)

       

      I am not able to figure out what could possibly going wrong with this approach on AMD.

      Please advise.

       

      Thank You

      Regards,

      -Pramod

       

      Queues used: Two In order queues

       

      Code snippet:

      _________________________________________________________________________________________

       

      for(uint8_t *ptr(data); ptr < data+dat_sz; ptr+=(2*buf_sz), i++) {

                  clEnqueueWriteBuffer(cqs[0],bufs[0],CL_FALSE,0,buf_sz,ptr,0,NULL,&evt[6*i+0]);

                      clFlush(cqs[0]);

                      clFlush(cqs[1]);

                      clSetKernelArg(kernel,0,sizeof(cl_mem),(void *)&bufs[0]);

                  clEnqueueNDRangeKernel(cqs[0],kernel,1,NULL,&gws,&lws,1,NULL,&evt[6*i+1]);

                  clEnqueueWriteBuffer(cqs[1],bufs[1],CL_FALSE,0,buf_sz,ptr+buf_sz,0,NULL,&evt[6*i+2]);

                      clFlush(cqs[0]);

                      clFlush(cqs[1]);

                      clSetKernelArg(kernel,0,sizeof(cl_mem),(void *)&bufs[1]);

                  clEnqueueNDRangeKernel(cqs[1],kernel,1,NULL,&gws,&lws,0,NULL,&evt[6*i+3]);

                  clEnqueueReadBuffer(cqs[0],bufs[0],CL_FALSE,0,buf_sz,ptr,0,NULL,&evt[6*i+4]);

                      clFlush(cqs[0]);

                      clFlush(cqs[1]);

                  clEnqueueReadBuffer(cqs[1],bufs[1],CL_FALSE,0,buf_sz,ptr+buf_sz,0,NULL,&evt[6*i+5]);

              }

      _________________________________________________________________________________________

       

      Other Details:

       

      Driver:

      amd-catalyst-13.8-beta2-linux-x86.x86_64.run

       

      SDK:

      AMD-APP-SDK-v2.8.1.0-RC-lnx64.tgz

       

      Device details from clinfo command

      --------------------------------------------------------------

        Platform Name:     AMD Accelerated Parallel Processing
      Number of devices:     2
        Device Type:     CL_DEVICE_TYPE_GPU
        Device ID:     4098
        Board name:     AMD Radeon HD 7900 Series
        Device Topology:     PCI[ B#15, D#0, F#0 ]
        Name:     Tahiti
        Vendor:     Advanced Micro Devices, Inc.
        Device OpenCL C version:     OpenCL C 1.2
        Driver version:     1307.1 (VM)
        Profile:     FULL_PROFILE
        Version:     OpenCL 1.2 AMD-APP (1307.1)

       

       

      Results:

      without clFlush()

      buf_sz = 16 MB    data_sz=256 MB

      Write/Kernel/Read  2 queues,  ALLOC_HOST,

      CMD, SUBMIT:  START-END TIME

         0: W,   0.0:   0.7-  9.0  8.3 ms,| X,   0.0:   0.7-  9.0  8.3 ms,| W,   0.8:   2.2- 10.7  8.4 ms,| X,   2.2:  10.7- 17.7  7.1 ms,| R,   0.7:   9.0- 14.3  5.3 ms,| R,   9.9:  17.7- 69.2 51.4 ms,|

         1: W,   2.2:  14.3- 18.1  3.8 ms,| X,   2.2:  14.3- 18.1  3.8 ms,| W,  68.7:  69.2- 71.6  2.4 ms,| X,  69.2:  71.6- 74.4  2.8 ms,| R,   2.6:  18.1- 38.3 20.1 ms,| R,  69.3:  74.4- 77.0  2.6 ms,|

         2: W,   9.6:  38.3- 40.7  2.4 ms,| X,   9.6:  38.3- 40.7  2.4 ms,| W,  69.4:  77.0- 79.4  2.4 ms,| X,  69.6:  79.4- 82.3  2.8 ms,| R,  10.0:  40.7- 43.3  2.6 ms,| R,  69.7:  82.3- 84.9  2.6 ms,|

         3: W,  10.2:  43.3- 45.8  2.5 ms,| X,  10.2:  43.3- 45.8  2.5 ms,| W,  69.8:  84.9- 87.3  2.4 ms,| X,  69.9:  87.3- 90.1  2.8 ms,| R,  10.3:  45.8- 48.4  2.6 ms,| R,  70.0:  90.1- 92.7  2.6 ms,|

         4: W,  10.5:  48.4- 50.8  2.4 ms,| X,  10.5:  48.4- 50.8  2.4 ms,| W,  70.2:  92.7- 95.1  2.4 ms,| X,  70.3:  95.1- 98.0  2.8 ms,| R,  10.7:  50.8- 53.4  2.6 ms,| R,  70.4:  98.0-100.6  2.6 ms,|

         5: W,  10.8:  53.4- 55.9  2.4 ms,| X,  10.8:  53.4- 55.9  2.4 ms,| W,  70.5: 100.6-103.0  2.4 ms,| X,  70.6: 103.0-105.8  2.8 ms,| R,  11.0:  55.9- 58.5  2.6 ms,| R,  70.7: 105.8-108.5  2.7 ms,|

         6: W,  11.1:  58.5- 60.9  2.4 ms,| X,  11.1:  58.5- 60.9  2.4 ms,| W,  70.9: 108.5-110.9  2.4 ms,| X,  71.0: 110.9-113.7  2.8 ms,| R,  11.3:  60.9- 63.5  2.6 ms,| R,  71.1: 113.7-116.4  2.6 ms,|

         7: W,  11.4:  63.5- 66.0  2.4 ms,| X,  11.4:  63.5- 66.0  2.4 ms,| W,  71.2: 116.4-118.8  2.4 ms,| X,  71.3: 118.8-121.6  2.8 ms,| R,  11.6:  66.0-124.7 58.7 ms,| R,  71.5: 121.6-124.4  2.8 ms,|

       

      with clFlush()

      buf_sz = 16 MB    data_sz=256 MB

      Write/Kernel/Read  2 queues,  ALLOC_HOST,

      CMD, SUBMIT:  START-END TIME

         0: W,   0.0: 122.0-126.7  4.7 ms,| X,   0.0: 122.0-126.7  4.7 ms,| W,   0.2:   3.2-  7.9  4.7 ms,| X,   0.6:   5.1-  7.9  2.8 ms,| R,   0.1: 123.8-126.7  2.8 ms,| R,   7.9:  10.9- 20.7  9.8 ms,|

         1: W,   0.2: 123.6-126.7  3.1 ms,| X,   0.2: 123.6-126.7  3.1 ms,| W,   8.0:  20.7- 23.5  2.8 ms,| X,   8.1:  20.7- 23.5  2.8 ms,| R,   0.6: 117.2-126.7  9.5 ms,| R,   8.2: 106.5-123.0 16.5 ms,|

         2: W,   0.9: 124.1-126.7  2.6 ms,| X,   0.9: 124.1-126.7  2.6 ms,| W,   8.4: 118.2-123.0  4.8 ms,| X,   8.5: 120.2-123.0  2.8 ms,| R,   1.2: 109.7-126.7 17.0 ms,| R,   8.6: 109.3-123.0 13.7 ms,|

         3: W,   1.6: 121.8-126.7  4.9 ms,| X,   1.6: 121.8-126.7  4.9 ms,| W,   8.7: 120.5-123.0  2.5 ms,| X,   8.8: 120.2-123.0  2.8 ms,| R,   1.7: 124.1-126.7  2.6 ms,| R,   8.9: 109.1-123.0 13.9 ms,|

         4: W,   1.8: 124.3-126.7  2.5 ms,| X,   1.8: 124.3-126.7  2.5 ms,| W,   9.1: 118.2-123.0  4.8 ms,| X,   9.2: 120.2-123.0  2.8 ms,| R,   2.2: 112.8-126.7 14.0 ms,| R,   9.3: 112.2-123.0 10.9 ms,|

         5: W,   2.3: 124.3-126.8  2.4 ms,| X,   2.3: 124.3-126.8  2.4 ms,| W,   9.4: 120.5-123.0  2.6 ms,| X,   9.5: 120.2-123.0  2.8 ms,| R,   2.4: 112.4-126.8 14.3 ms,| R,   9.6: 120.4-123.1  2.6 ms,|

         6: W,   2.7: 121.9-126.8  4.8 ms,| X,   2.7: 121.9-126.8  4.8 ms,| W,   9.8: 120.6-123.1  2.4 ms,| X,  10.0: 120.2-123.1  2.8 ms,| R,   2.8: 124.1-126.8  2.6 ms,| R,  10.2: 120.4-123.1  2.6 ms,|

         7: W,   2.9: 124.2-126.8  2.6 ms,| X,   2.9: 124.2-126.8  2.6 ms,| W,  10.3: 120.6-123.1  2.4 ms,| X,  10.5: 120.3-123.1  2.8 ms,| R,   3.0:  90.6-126.9 36.3 ms,| R,  10.7: 120.2-126.6  6.4 ms,|

      Data Matched

       

      ________________________________________________________________________________

      without clFlush()

      buf_sz = 16 MB    data_sz=256 MB

      Write/Kernel/Read  2 queues,  USE_HOST,

      CMD, SUBMIT:  START-END TIME

         0: W,   0.1:   0.4-  5.3  4.9 ms,| X,   0.1:   0.4-  5.3  4.9 ms,| W,   0.0:   0.5-  5.4  4.9 ms,| X,   0.6:   5.4-  8.2  2.8 ms,| R,   0.5:   5.3-  8.1  2.8 ms,| R,   1.7:   8.2- 13.3  5.1 ms,|

         1: W,   0.7:   8.1- 10.6  2.6 ms,| X,   0.7:   8.1- 10.6  2.6 ms,| W,  13.1:  13.3- 16.3  3.0 ms,| X,  13.3:  16.3- 19.1  2.8 ms,| R,   0.8:  10.6- 15.5  4.9 ms,| R,  13.4:  19.1- 31.4 12.3 ms,|

         2: W,   1.5:  15.5- 18.3  2.8 ms,| X,   1.5:  15.5- 18.3  2.8 ms,| W,  13.5:  31.4- 35.4  4.0 ms,| X,  13.6:  35.4- 38.2  2.8 ms,| R,   1.7:  18.3- 30.3 12.0 ms,| R,  13.7:  38.2- 46.5  8.3 ms,|

         3: W,   1.8:  30.3- 34.4  4.1 ms,| X,   1.8:  30.3- 34.4  4.1 ms,| W,  13.8:  46.5- 49.2  2.7 ms,| X,  13.8:  49.2- 52.0  2.8 ms,| R,   1.9:  34.4- 37.4  3.0 ms,| R,  13.9:  52.0- 77.6 25.6 ms,|

         4: W,   2.0:  37.4- 39.9  2.5 ms,| X,   2.0:  37.4- 39.9  2.5 ms,| W,  14.0:  77.6- 82.4  4.8 ms,| X,  14.1:  82.4- 85.3  2.8 ms,| R,   2.2:  39.9- 48.8  8.9 ms,| R,  14.2:  85.3- 93.0  7.8 ms,|

         5: W,   2.2:  48.8- 51.4  2.6 ms,| X,   2.2:  48.8- 51.4  2.6 ms,| W,  14.3:  93.0- 95.6  2.6 ms,| X,  14.4:  95.6- 98.4  2.8 ms,| R,   2.3:  51.4- 77.6 26.2 ms,| R,  14.5:  98.4-101.1  2.6 ms,|

         6: W,   2.4:  77.6- 82.4  4.8 ms,| X,   2.4:  77.6- 82.4  4.8 ms,| W,  14.6: 101.1-103.5  2.4 ms,| X,  14.8: 103.5-106.3  2.8 ms,| R,   2.5:  82.4- 85.1  2.6 ms,| R,  14.9: 106.3-108.9  2.6 ms,|

         7: W,   2.6:  85.1- 87.6  2.5 ms,| X,   2.6:  85.1- 87.6  2.5 ms,| W,  15.0: 108.9-111.4  2.4 ms,| X,  15.1: 111.4-114.2  2.8 ms,| R,   2.7:  87.6-118.1 30.5 ms,| R,  15.2: 114.2-118.0  3.8 ms,|

       

      with clFlush()

      buf_sz = 16 MB    data_sz=256 MB

      Write/Kernel/Read  2 queues,  USE_HOST,

      CMD, SUBMIT:  START-END TIME

         0: W,   0.1: 112.8-117.7  4.9 ms,| X,   0.1: 112.8-117.7  4.9 ms,| W,   0.1:   3.2-  8.1  4.9 ms,| X,   0.4:   5.3-  8.1  2.8 ms,| R,   0.3: 114.9-117.7  2.8 ms,| R,   8.1:  10.8- 15.6  4.9 ms,|

         1: W,   0.5: 115.1-117.7  2.6 ms,| X,   0.5: 115.1-117.7  2.6 ms,| W,   8.2:  16.0- 18.5  2.6 ms,| X,   8.3:  15.7- 18.5  2.8 ms,| R,   0.7: 112.6-117.7  5.1 ms,| R,   8.4: 100.6-117.0 16.4 ms,|

         2: W,   1.0: 115.3-117.7  2.5 ms,| X,   1.0: 115.3-117.7  2.5 ms,| W,   8.6: 110.8-117.1  6.4 ms,| X,   8.8: 114.3-117.2  2.8 ms,| R,   1.2: 101.0-117.8 16.8 ms,| R,   8.9: 105.6-117.2 11.6 ms,|

         3: W,   1.3: 111.4-117.8  6.4 ms,| X,   1.3: 111.4-117.8  6.4 ms,| W,   9.1: 114.7-117.2  2.5 ms,| X,   9.2: 114.4-117.2  2.8 ms,| R,   1.4: 115.2-117.8  2.6 ms,| R,   9.4:  97.4-117.2 19.8 ms,|

         4: W,   1.5: 115.3-117.8  2.5 ms,| X,   1.5: 115.3-117.8  2.5 ms,| W,   9.5: 112.4-117.3  4.8 ms,| X,   9.7: 114.4-117.3  2.8 ms,| R,   1.6: 105.9-117.8 11.9 ms,| R,   9.8: 112.5-117.3  4.7 ms,|

         5: W,   1.7: 115.4-117.8  2.4 ms,| X,   1.7: 115.4-117.8  2.4 ms,| W,  10.0: 114.7-117.3  2.6 ms,| X,  10.1: 114.5-117.3  2.8 ms,| R,   1.8:  97.6-117.8 20.2 ms,| R,  10.3: 114.7-117.3  2.6 ms,|

         6: W,   1.9: 113.0-117.8  4.8 ms,| X,   1.9: 113.0-117.8  4.8 ms,| W,  10.4: 114.9-117.3  2.4 ms,| X,  10.6: 114.5-117.3  2.8 ms,| R,   2.0: 115.1-117.8  2.6 ms,| R,  10.8: 114.7-117.3  2.6 ms,|

         7: W,   2.1: 115.2-117.8  2.6 ms,| X,   2.1: 115.2-117.8  2.6 ms,| W,  11.0: 114.9-117.3  2.4 ms,| X,  11.2: 114.5-117.4  2.8 ms,| R,   2.2:  90.8-117.8 27.0 ms,| R,  11.4: 114.3-117.6  3.4 ms,|

        • Re: clFlush() and Event Profiling Problem.
          himanshu.gautam

          Hi I am able to reproduce the same result here. I will forward it to concern people and get back to you. Meanwhile i want you to check the OpenCL spec 1.2, it may help you in sorting out this problem.

          • Re: clFlush() and Event Profiling Problem.
            german

            1. The program itself contains 2 bugs:

                 a) line 64             clEnqueueNDRangeKernel(cqs[0],kernel,1,NULL,&gws,&lws,1,NULL,&evt[6*i+1]);

                 You send 1 for the number of wait events, but the list is NULL. Runtime fails that call
                  b) From line 79

                    uint8_t *p(data);

                    if(use_clflush) {

                        ::memset(tempBuf0,0,buf_sz);

                        memcpy(tempBuf0,p,buf_sz);

                    } else {

                        ::memset(tempBuf1,0,buf_sz);

                        memcpy(tempBuf1,p,buf_sz);

                    }

                  The code will copy data without actual wait for the clEnqueueReadBuffer calls. The program has to call clFinish/clWaitForEvents or clEnqueueReadBuffer should be a blocking command.

             

            2. The execution times are correct and match the case without clFlush(), but the timeline is broken. The incorrect timeline reported for one queue will be fixed in the new driver releases. OpenCL runtime has to readjust reported GPU timer values to the CPU time line. There are sync points when calibration occurs. With frequent clFlushes the sync point could get a wrong value for the CPU tick. Basically it was the end of entire execution. That's why you see almost the same value for the end in all commands on one queue.