cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pramodv_a
Journeyman III

clFlush() and Event Profiling Problem.

Hi,

I am trying to parallelize DMA and computation with two in order queues, facing strange Event Profiling results when clFlush() is being used.

On AMD GPU almost all events recorded with same/invalid time stamps. (Please refer to to the results stated below, marked in red)


This method is shared by "Nvidia_OpenCL_SDK_4_2_Linux.zip" (oclCopyComputeOverlap.cpp), and works fine on Nvidia GPU's.

Attached the source code for your reference. (in-order-2queues.cpp)

I am not able to figure out what could possibly going wrong with this approach on AMD.

Please advise.

Thank You

Regards,

-Pramod

Queues used: Two In order queues

Code snippet:

_________________________________________________________________________________________

for(uint8_t *ptr(data); ptr < data+dat_sz; ptr+=(2*buf_sz), i++) {

            clEnqueueWriteBuffer(cqs[0],bufs[0],CL_FALSE,0,buf_sz,ptr,0,NULL,&evt[6*i+0]);

                clFlush(cqs[0]);

                clFlush(cqs[1]);

                clSetKernelArg(kernel,0,sizeof(cl_mem),(void *)&bufs[0]);

            clEnqueueNDRangeKernel(cqs[0],kernel,1,NULL,&gws,&lws,1,NULL,&evt[6*i+1]);

            clEnqueueWriteBuffer(cqs[1],bufs[1],CL_FALSE,0,buf_sz,ptr+buf_sz,0,NULL,&evt[6*i+2]);

                clFlush(cqs[0]);

                clFlush(cqs[1]);

                clSetKernelArg(kernel,0,sizeof(cl_mem),(void *)&bufs[1]);

            clEnqueueNDRangeKernel(cqs[1],kernel,1,NULL,&gws,&lws,0,NULL,&evt[6*i+3]);

            clEnqueueReadBuffer(cqs[0],bufs[0],CL_FALSE,0,buf_sz,ptr,0,NULL,&evt[6*i+4]);

                clFlush(cqs[0]);

                clFlush(cqs[1]);

            clEnqueueReadBuffer(cqs[1],bufs[1],CL_FALSE,0,buf_sz,ptr+buf_sz,0,NULL,&evt[6*i+5]);

        }

_________________________________________________________________________________________

Other Details:

Driver:

amd-catalyst-13.8-beta2-linux-x86.x86_64.run

SDK:

AMD-APP-SDK-v2.8.1.0-RC-lnx64.tgz

Device details from clinfo command

--------------------------------------------------------------

  Platform Name:     AMD Accelerated Parallel Processing
Number of devices:     2
  Device Type:     CL_DEVICE_TYPE_GPU
  Device ID:     4098
  Board name:     AMD Radeon HD 7900 Series
  Device Topology:     PCI[ B#15, D#0, F#0 ]
  Name:     Tahiti
  Vendor:     Advanced Micro Devices, Inc.
  Device OpenCL C version:     OpenCL C 1.2
  Driver version:     1307.1 (VM)
  Profile:     FULL_PROFILE
  Version:     OpenCL 1.2 AMD-APP (1307.1)

Results:

without clFlush()

buf_sz = 16 MB    data_sz=256 MB

Write/Kernel/Read  2 queues,  ALLOC_HOST,

CMD, SUBMIT:  START-END TIME

   0: W,   0.0:   0.7-  9.0  8.3 ms,| X,   0.0:   0.7-  9.0  8.3 ms,| W,   0.8:   2.2- 10.7  8.4 ms,| X,   2.2:  10.7- 17.7  7.1 ms,| R,   0.7:   9.0- 14.3  5.3 ms,| R,   9.9:  17.7- 69.2 51.4 ms,|

   1: W,   2.2:  14.3- 18.1  3.8 ms,| X,   2.2:  14.3- 18.1  3.8 ms,| W,  68.7:  69.2- 71.6  2.4 ms,| X,  69.2:  71.6- 74.4  2.8 ms,| R,   2.6:  18.1- 38.3 20.1 ms,| R,  69.3:  74.4- 77.0  2.6 ms,|

   2: W,   9.6:  38.3- 40.7  2.4 ms,| X,   9.6:  38.3- 40.7  2.4 ms,| W,  69.4:  77.0- 79.4  2.4 ms,| X,  69.6:  79.4- 82.3  2.8 ms,| R,  10.0:  40.7- 43.3  2.6 ms,| R,  69.7:  82.3- 84.9  2.6 ms,|

   3: W,  10.2:  43.3- 45.8  2.5 ms,| X,  10.2:  43.3- 45.8  2.5 ms,| W,  69.8:  84.9- 87.3  2.4 ms,| X,  69.9:  87.3- 90.1  2.8 ms,| R,  10.3:  45.8- 48.4  2.6 ms,| R,  70.0:  90.1- 92.7  2.6 ms,|

   4: W,  10.5:  48.4- 50.8  2.4 ms,| X,  10.5:  48.4- 50.8  2.4 ms,| W,  70.2:  92.7- 95.1  2.4 ms,| X,  70.3:  95.1- 98.0  2.8 ms,| R,  10.7:  50.8- 53.4  2.6 ms,| R,  70.4:  98.0-100.6  2.6 ms,|

   5: W,  10.8:  53.4- 55.9  2.4 ms,| X,  10.8:  53.4- 55.9  2.4 ms,| W,  70.5: 100.6-103.0  2.4 ms,| X,  70.6: 103.0-105.8  2.8 ms,| R,  11.0:  55.9- 58.5  2.6 ms,| R,  70.7: 105.8-108.5  2.7 ms,|

   6: W,  11.1:  58.5- 60.9  2.4 ms,| X,  11.1:  58.5- 60.9  2.4 ms,| W,  70.9: 108.5-110.9  2.4 ms,| X,  71.0: 110.9-113.7  2.8 ms,| R,  11.3:  60.9- 63.5  2.6 ms,| R,  71.1: 113.7-116.4  2.6 ms,|

   7: W,  11.4:  63.5- 66.0  2.4 ms,| X,  11.4:  63.5- 66.0  2.4 ms,| W,  71.2: 116.4-118.8  2.4 ms,| X,  71.3: 118.8-121.6  2.8 ms,| R,  11.6:  66.0-124.7 58.7 ms,| R,  71.5: 121.6-124.4  2.8 ms,|

with clFlush()

buf_sz = 16 MB    data_sz=256 MB

Write/Kernel/Read  2 queues,  ALLOC_HOST,

CMD, SUBMIT:  START-END TIME

   0: W,   0.0: 122.0-126.7  4.7 ms,| X,   0.0: 122.0-126.7  4.7 ms,| W,   0.2:   3.2-  7.9  4.7 ms,| X,   0.6:   5.1-  7.9  2.8 ms,| R,   0.1: 123.8-126.7  2.8 ms,| R,   7.9:  10.9- 20.7  9.8 ms,|

   1: W,   0.2: 123.6-126.7  3.1 ms,| X,   0.2: 123.6-126.7  3.1 ms,| W,   8.0:  20.7- 23.5  2.8 ms,| X,   8.1:  20.7- 23.5  2.8 ms,| R,   0.6: 117.2-126.7  9.5 ms,| R,   8.2: 106.5-123.0 16.5 ms,|

   2: W,   0.9: 124.1-126.7  2.6 ms,| X,   0.9: 124.1-126.7  2.6 ms,| W,   8.4: 118.2-123.0  4.8 ms,| X,   8.5: 120.2-123.0  2.8 ms,| R,   1.2: 109.7-126.7 17.0 ms,| R,   8.6: 109.3-123.0 13.7 ms,|

   3: W,   1.6: 121.8-126.7  4.9 ms,| X,   1.6: 121.8-126.7  4.9 ms,| W,   8.7: 120.5-123.0  2.5 ms,| X,   8.8: 120.2-123.0  2.8 ms,| R,   1.7: 124.1-126.7  2.6 ms,| R,   8.9: 109.1-123.0 13.9 ms,|

   4: W,   1.8: 124.3-126.7  2.5 ms,| X,   1.8: 124.3-126.7  2.5 ms,| W,   9.1: 118.2-123.0  4.8 ms,| X,   9.2: 120.2-123.0  2.8 ms,| R,   2.2: 112.8-126.7 14.0 ms,| R,   9.3: 112.2-123.0 10.9 ms,|

   5: W,   2.3: 124.3-126.8  2.4 ms,| X,   2.3: 124.3-126.8  2.4 ms,| W,   9.4: 120.5-123.0  2.6 ms,| X,   9.5: 120.2-123.0  2.8 ms,| R,   2.4: 112.4-126.8 14.3 ms,| R,   9.6: 120.4-123.1  2.6 ms,|

   6: W,   2.7: 121.9-126.8  4.8 ms,| X,   2.7: 121.9-126.8  4.8 ms,| W,   9.8: 120.6-123.1  2.4 ms,| X,  10.0: 120.2-123.1  2.8 ms,| R,   2.8: 124.1-126.8  2.6 ms,| R,  10.2: 120.4-123.1  2.6 ms,|

   7: W,   2.9: 124.2-126.8  2.6 ms,| X,   2.9: 124.2-126.8  2.6 ms,| W,  10.3: 120.6-123.1  2.4 ms,| X,  10.5: 120.3-123.1  2.8 ms,| R,   3.0:  90.6-126.9 36.3 ms,| R,  10.7: 120.2-126.6  6.4 ms,|

Data Matched

________________________________________________________________________________

without clFlush()

buf_sz = 16 MB    data_sz=256 MB

Write/Kernel/Read  2 queues,  USE_HOST,

CMD, SUBMIT:  START-END TIME

   0: W,   0.1:   0.4-  5.3  4.9 ms,| X,   0.1:   0.4-  5.3  4.9 ms,| W,   0.0:   0.5-  5.4  4.9 ms,| X,   0.6:   5.4-  8.2  2.8 ms,| R,   0.5:   5.3-  8.1  2.8 ms,| R,   1.7:   8.2- 13.3  5.1 ms,|

   1: W,   0.7:   8.1- 10.6  2.6 ms,| X,   0.7:   8.1- 10.6  2.6 ms,| W,  13.1:  13.3- 16.3  3.0 ms,| X,  13.3:  16.3- 19.1  2.8 ms,| R,   0.8:  10.6- 15.5  4.9 ms,| R,  13.4:  19.1- 31.4 12.3 ms,|

   2: W,   1.5:  15.5- 18.3  2.8 ms,| X,   1.5:  15.5- 18.3  2.8 ms,| W,  13.5:  31.4- 35.4  4.0 ms,| X,  13.6:  35.4- 38.2  2.8 ms,| R,   1.7:  18.3- 30.3 12.0 ms,| R,  13.7:  38.2- 46.5  8.3 ms,|

   3: W,   1.8:  30.3- 34.4  4.1 ms,| X,   1.8:  30.3- 34.4  4.1 ms,| W,  13.8:  46.5- 49.2  2.7 ms,| X,  13.8:  49.2- 52.0  2.8 ms,| R,   1.9:  34.4- 37.4  3.0 ms,| R,  13.9:  52.0- 77.6 25.6 ms,|

   4: W,   2.0:  37.4- 39.9  2.5 ms,| X,   2.0:  37.4- 39.9  2.5 ms,| W,  14.0:  77.6- 82.4  4.8 ms,| X,  14.1:  82.4- 85.3  2.8 ms,| R,   2.2:  39.9- 48.8  8.9 ms,| R,  14.2:  85.3- 93.0  7.8 ms,|

   5: W,   2.2:  48.8- 51.4  2.6 ms,| X,   2.2:  48.8- 51.4  2.6 ms,| W,  14.3:  93.0- 95.6  2.6 ms,| X,  14.4:  95.6- 98.4  2.8 ms,| R,   2.3:  51.4- 77.6 26.2 ms,| R,  14.5:  98.4-101.1  2.6 ms,|

   6: W,   2.4:  77.6- 82.4  4.8 ms,| X,   2.4:  77.6- 82.4  4.8 ms,| W,  14.6: 101.1-103.5  2.4 ms,| X,  14.8: 103.5-106.3  2.8 ms,| R,   2.5:  82.4- 85.1  2.6 ms,| R,  14.9: 106.3-108.9  2.6 ms,|

   7: W,   2.6:  85.1- 87.6  2.5 ms,| X,   2.6:  85.1- 87.6  2.5 ms,| W,  15.0: 108.9-111.4  2.4 ms,| X,  15.1: 111.4-114.2  2.8 ms,| R,   2.7:  87.6-118.1 30.5 ms,| R,  15.2: 114.2-118.0  3.8 ms,|

with clFlush()

buf_sz = 16 MB    data_sz=256 MB

Write/Kernel/Read  2 queues,  USE_HOST,

CMD, SUBMIT:  START-END TIME

   0: W,   0.1: 112.8-117.7  4.9 ms,| X,   0.1: 112.8-117.7  4.9 ms,| W,   0.1:   3.2-  8.1  4.9 ms,| X,   0.4:   5.3-  8.1  2.8 ms,| R,   0.3: 114.9-117.7  2.8 ms,| R,   8.1:  10.8- 15.6  4.9 ms,|

   1: W,   0.5: 115.1-117.7  2.6 ms,| X,   0.5: 115.1-117.7  2.6 ms,| W,   8.2:  16.0- 18.5  2.6 ms,| X,   8.3:  15.7- 18.5  2.8 ms,| R,   0.7: 112.6-117.7  5.1 ms,| R,   8.4: 100.6-117.0 16.4 ms,|

   2: W,   1.0: 115.3-117.7  2.5 ms,| X,   1.0: 115.3-117.7  2.5 ms,| W,   8.6: 110.8-117.1  6.4 ms,| X,   8.8: 114.3-117.2  2.8 ms,| R,   1.2: 101.0-117.8 16.8 ms,| R,   8.9: 105.6-117.2 11.6 ms,|

   3: W,   1.3: 111.4-117.8  6.4 ms,| X,   1.3: 111.4-117.8  6.4 ms,| W,   9.1: 114.7-117.2  2.5 ms,| X,   9.2: 114.4-117.2  2.8 ms,| R,   1.4: 115.2-117.8  2.6 ms,| R,   9.4:  97.4-117.2 19.8 ms,|

   4: W,   1.5: 115.3-117.8  2.5 ms,| X,   1.5: 115.3-117.8  2.5 ms,| W,   9.5: 112.4-117.3  4.8 ms,| X,   9.7: 114.4-117.3  2.8 ms,| R,   1.6: 105.9-117.8 11.9 ms,| R,   9.8: 112.5-117.3  4.7 ms,|

   5: W,   1.7: 115.4-117.8  2.4 ms,| X,   1.7: 115.4-117.8  2.4 ms,| W,  10.0: 114.7-117.3  2.6 ms,| X,  10.1: 114.5-117.3  2.8 ms,| R,   1.8:  97.6-117.8 20.2 ms,| R,  10.3: 114.7-117.3  2.6 ms,|

   6: W,   1.9: 113.0-117.8  4.8 ms,| X,   1.9: 113.0-117.8  4.8 ms,| W,  10.4: 114.9-117.3  2.4 ms,| X,  10.6: 114.5-117.3  2.8 ms,| R,   2.0: 115.1-117.8  2.6 ms,| R,  10.8: 114.7-117.3  2.6 ms,|

   7: W,   2.1: 115.2-117.8  2.6 ms,| X,   2.1: 115.2-117.8  2.6 ms,| W,  11.0: 114.9-117.3  2.4 ms,| X,  11.2: 114.5-117.4  2.8 ms,| R,   2.2:  90.8-117.8 27.0 ms,| R,  11.4: 114.3-117.6  3.4 ms,|

0 Likes
4 Replies
himanshu_gautam
Grandmaster

Hi I am able to reproduce the same result here. I will forward it to concern people and get back to you. Meanwhile i want you to check the OpenCL spec 1.2, it may help you in sorting out this problem.

0 Likes
german
Staff

1. The program itself contains 2 bugs:

     a) line 64             clEnqueueNDRangeKernel(cqs[0],kernel,1,NULL,&gws,&lws,1,NULL,&evt[6*i+1]);

     You send 1 for the number of wait events, but the list is NULL. Runtime fails that call
      b) From line 79

        uint8_t *p(data);

        if(use_clflush) {

            ::memset(tempBuf0,0,buf_sz);

            memcpy(tempBuf0,p,buf_sz);

        } else {

            ::memset(tempBuf1,0,buf_sz);

            memcpy(tempBuf1,p,buf_sz);

        }

      The code will copy data without actual wait for the clEnqueueReadBuffer calls. The program has to call clFinish/clWaitForEvents or clEnqueueReadBuffer should be a blocking command.

2. The execution times are correct and match the case without clFlush(), but the timeline is broken. The incorrect timeline reported for one queue will be fixed in the new driver releases. OpenCL runtime has to readjust reported GPU timer values to the CPU time line. There are sync points when calibration occurs. With frequent clFlushes the sync point could get a wrong value for the CPU tick. Basically it was the end of entire execution. That's why you see almost the same value for the end in all commands on one queue.

Hi,

Thank You for pointing out the bugs.

Waiting for new driver release with timeline fix.

Thank You

Regards

-Pramod

0 Likes

Hi Thanks for your reply Andryeyev. Pramodv_a please fix your bugs and try with new driver once released. thanks

0 Likes