AnsweredAssumed Answered

clFlush() and Event Profiling Problem.

Question asked by pramodv_a on Aug 26, 2013
Latest reply on Sep 11, 2013 by himanshu.gautam

Hi,

 

I am trying to parallelize DMA and computation with two in order queues, facing strange Event Profiling results when clFlush() is being used.

On AMD GPU almost all events recorded with same/invalid time stamps. (Please refer to to the results stated below, marked in red)


This method is shared by "Nvidia_OpenCL_SDK_4_2_Linux.zip" (oclCopyComputeOverlap.cpp), and works fine on Nvidia GPU's.

Attached the source code for your reference. (in-order-2queues.cpp)

 

I am not able to figure out what could possibly going wrong with this approach on AMD.

Please advise.

 

Thank You

Regards,

-Pramod

 

Queues used: Two In order queues

 

Code snippet:

_________________________________________________________________________________________

 

for(uint8_t *ptr(data); ptr < data+dat_sz; ptr+=(2*buf_sz), i++) {

            clEnqueueWriteBuffer(cqs[0],bufs[0],CL_FALSE,0,buf_sz,ptr,0,NULL,&evt[6*i+0]);

                clFlush(cqs[0]);

                clFlush(cqs[1]);

                clSetKernelArg(kernel,0,sizeof(cl_mem),(void *)&bufs[0]);

            clEnqueueNDRangeKernel(cqs[0],kernel,1,NULL,&gws,&lws,1,NULL,&evt[6*i+1]);

            clEnqueueWriteBuffer(cqs[1],bufs[1],CL_FALSE,0,buf_sz,ptr+buf_sz,0,NULL,&evt[6*i+2]);

                clFlush(cqs[0]);

                clFlush(cqs[1]);

                clSetKernelArg(kernel,0,sizeof(cl_mem),(void *)&bufs[1]);

            clEnqueueNDRangeKernel(cqs[1],kernel,1,NULL,&gws,&lws,0,NULL,&evt[6*i+3]);

            clEnqueueReadBuffer(cqs[0],bufs[0],CL_FALSE,0,buf_sz,ptr,0,NULL,&evt[6*i+4]);

                clFlush(cqs[0]);

                clFlush(cqs[1]);

            clEnqueueReadBuffer(cqs[1],bufs[1],CL_FALSE,0,buf_sz,ptr+buf_sz,0,NULL,&evt[6*i+5]);

        }

_________________________________________________________________________________________

 

Other Details:

 

Driver:

amd-catalyst-13.8-beta2-linux-x86.x86_64.run

 

SDK:

AMD-APP-SDK-v2.8.1.0-RC-lnx64.tgz

 

Device details from clinfo command

--------------------------------------------------------------

  Platform Name:     AMD Accelerated Parallel Processing
Number of devices:     2
  Device Type:     CL_DEVICE_TYPE_GPU
  Device ID:     4098
  Board name:     AMD Radeon HD 7900 Series
  Device Topology:     PCI[ B#15, D#0, F#0 ]
  Name:     Tahiti
  Vendor:     Advanced Micro Devices, Inc.
  Device OpenCL C version:     OpenCL C 1.2
  Driver version:     1307.1 (VM)
  Profile:     FULL_PROFILE
  Version:     OpenCL 1.2 AMD-APP (1307.1)

 

 

Results:

without clFlush()

buf_sz = 16 MB    data_sz=256 MB

Write/Kernel/Read  2 queues,  ALLOC_HOST,

CMD, SUBMIT:  START-END TIME

   0: W,   0.0:   0.7-  9.0  8.3 ms,| X,   0.0:   0.7-  9.0  8.3 ms,| W,   0.8:   2.2- 10.7  8.4 ms,| X,   2.2:  10.7- 17.7  7.1 ms,| R,   0.7:   9.0- 14.3  5.3 ms,| R,   9.9:  17.7- 69.2 51.4 ms,|

   1: W,   2.2:  14.3- 18.1  3.8 ms,| X,   2.2:  14.3- 18.1  3.8 ms,| W,  68.7:  69.2- 71.6  2.4 ms,| X,  69.2:  71.6- 74.4  2.8 ms,| R,   2.6:  18.1- 38.3 20.1 ms,| R,  69.3:  74.4- 77.0  2.6 ms,|

   2: W,   9.6:  38.3- 40.7  2.4 ms,| X,   9.6:  38.3- 40.7  2.4 ms,| W,  69.4:  77.0- 79.4  2.4 ms,| X,  69.6:  79.4- 82.3  2.8 ms,| R,  10.0:  40.7- 43.3  2.6 ms,| R,  69.7:  82.3- 84.9  2.6 ms,|

   3: W,  10.2:  43.3- 45.8  2.5 ms,| X,  10.2:  43.3- 45.8  2.5 ms,| W,  69.8:  84.9- 87.3  2.4 ms,| X,  69.9:  87.3- 90.1  2.8 ms,| R,  10.3:  45.8- 48.4  2.6 ms,| R,  70.0:  90.1- 92.7  2.6 ms,|

   4: W,  10.5:  48.4- 50.8  2.4 ms,| X,  10.5:  48.4- 50.8  2.4 ms,| W,  70.2:  92.7- 95.1  2.4 ms,| X,  70.3:  95.1- 98.0  2.8 ms,| R,  10.7:  50.8- 53.4  2.6 ms,| R,  70.4:  98.0-100.6  2.6 ms,|

   5: W,  10.8:  53.4- 55.9  2.4 ms,| X,  10.8:  53.4- 55.9  2.4 ms,| W,  70.5: 100.6-103.0  2.4 ms,| X,  70.6: 103.0-105.8  2.8 ms,| R,  11.0:  55.9- 58.5  2.6 ms,| R,  70.7: 105.8-108.5  2.7 ms,|

   6: W,  11.1:  58.5- 60.9  2.4 ms,| X,  11.1:  58.5- 60.9  2.4 ms,| W,  70.9: 108.5-110.9  2.4 ms,| X,  71.0: 110.9-113.7  2.8 ms,| R,  11.3:  60.9- 63.5  2.6 ms,| R,  71.1: 113.7-116.4  2.6 ms,|

   7: W,  11.4:  63.5- 66.0  2.4 ms,| X,  11.4:  63.5- 66.0  2.4 ms,| W,  71.2: 116.4-118.8  2.4 ms,| X,  71.3: 118.8-121.6  2.8 ms,| R,  11.6:  66.0-124.7 58.7 ms,| R,  71.5: 121.6-124.4  2.8 ms,|

 

with clFlush()

buf_sz = 16 MB    data_sz=256 MB

Write/Kernel/Read  2 queues,  ALLOC_HOST,

CMD, SUBMIT:  START-END TIME

   0: W,   0.0: 122.0-126.7  4.7 ms,| X,   0.0: 122.0-126.7  4.7 ms,| W,   0.2:   3.2-  7.9  4.7 ms,| X,   0.6:   5.1-  7.9  2.8 ms,| R,   0.1: 123.8-126.7  2.8 ms,| R,   7.9:  10.9- 20.7  9.8 ms,|

   1: W,   0.2: 123.6-126.7  3.1 ms,| X,   0.2: 123.6-126.7  3.1 ms,| W,   8.0:  20.7- 23.5  2.8 ms,| X,   8.1:  20.7- 23.5  2.8 ms,| R,   0.6: 117.2-126.7  9.5 ms,| R,   8.2: 106.5-123.0 16.5 ms,|

   2: W,   0.9: 124.1-126.7  2.6 ms,| X,   0.9: 124.1-126.7  2.6 ms,| W,   8.4: 118.2-123.0  4.8 ms,| X,   8.5: 120.2-123.0  2.8 ms,| R,   1.2: 109.7-126.7 17.0 ms,| R,   8.6: 109.3-123.0 13.7 ms,|

   3: W,   1.6: 121.8-126.7  4.9 ms,| X,   1.6: 121.8-126.7  4.9 ms,| W,   8.7: 120.5-123.0  2.5 ms,| X,   8.8: 120.2-123.0  2.8 ms,| R,   1.7: 124.1-126.7  2.6 ms,| R,   8.9: 109.1-123.0 13.9 ms,|

   4: W,   1.8: 124.3-126.7  2.5 ms,| X,   1.8: 124.3-126.7  2.5 ms,| W,   9.1: 118.2-123.0  4.8 ms,| X,   9.2: 120.2-123.0  2.8 ms,| R,   2.2: 112.8-126.7 14.0 ms,| R,   9.3: 112.2-123.0 10.9 ms,|

   5: W,   2.3: 124.3-126.8  2.4 ms,| X,   2.3: 124.3-126.8  2.4 ms,| W,   9.4: 120.5-123.0  2.6 ms,| X,   9.5: 120.2-123.0  2.8 ms,| R,   2.4: 112.4-126.8 14.3 ms,| R,   9.6: 120.4-123.1  2.6 ms,|

   6: W,   2.7: 121.9-126.8  4.8 ms,| X,   2.7: 121.9-126.8  4.8 ms,| W,   9.8: 120.6-123.1  2.4 ms,| X,  10.0: 120.2-123.1  2.8 ms,| R,   2.8: 124.1-126.8  2.6 ms,| R,  10.2: 120.4-123.1  2.6 ms,|

   7: W,   2.9: 124.2-126.8  2.6 ms,| X,   2.9: 124.2-126.8  2.6 ms,| W,  10.3: 120.6-123.1  2.4 ms,| X,  10.5: 120.3-123.1  2.8 ms,| R,   3.0:  90.6-126.9 36.3 ms,| R,  10.7: 120.2-126.6  6.4 ms,|

Data Matched

 

________________________________________________________________________________

without clFlush()

buf_sz = 16 MB    data_sz=256 MB

Write/Kernel/Read  2 queues,  USE_HOST,

CMD, SUBMIT:  START-END TIME

   0: W,   0.1:   0.4-  5.3  4.9 ms,| X,   0.1:   0.4-  5.3  4.9 ms,| W,   0.0:   0.5-  5.4  4.9 ms,| X,   0.6:   5.4-  8.2  2.8 ms,| R,   0.5:   5.3-  8.1  2.8 ms,| R,   1.7:   8.2- 13.3  5.1 ms,|

   1: W,   0.7:   8.1- 10.6  2.6 ms,| X,   0.7:   8.1- 10.6  2.6 ms,| W,  13.1:  13.3- 16.3  3.0 ms,| X,  13.3:  16.3- 19.1  2.8 ms,| R,   0.8:  10.6- 15.5  4.9 ms,| R,  13.4:  19.1- 31.4 12.3 ms,|

   2: W,   1.5:  15.5- 18.3  2.8 ms,| X,   1.5:  15.5- 18.3  2.8 ms,| W,  13.5:  31.4- 35.4  4.0 ms,| X,  13.6:  35.4- 38.2  2.8 ms,| R,   1.7:  18.3- 30.3 12.0 ms,| R,  13.7:  38.2- 46.5  8.3 ms,|

   3: W,   1.8:  30.3- 34.4  4.1 ms,| X,   1.8:  30.3- 34.4  4.1 ms,| W,  13.8:  46.5- 49.2  2.7 ms,| X,  13.8:  49.2- 52.0  2.8 ms,| R,   1.9:  34.4- 37.4  3.0 ms,| R,  13.9:  52.0- 77.6 25.6 ms,|

   4: W,   2.0:  37.4- 39.9  2.5 ms,| X,   2.0:  37.4- 39.9  2.5 ms,| W,  14.0:  77.6- 82.4  4.8 ms,| X,  14.1:  82.4- 85.3  2.8 ms,| R,   2.2:  39.9- 48.8  8.9 ms,| R,  14.2:  85.3- 93.0  7.8 ms,|

   5: W,   2.2:  48.8- 51.4  2.6 ms,| X,   2.2:  48.8- 51.4  2.6 ms,| W,  14.3:  93.0- 95.6  2.6 ms,| X,  14.4:  95.6- 98.4  2.8 ms,| R,   2.3:  51.4- 77.6 26.2 ms,| R,  14.5:  98.4-101.1  2.6 ms,|

   6: W,   2.4:  77.6- 82.4  4.8 ms,| X,   2.4:  77.6- 82.4  4.8 ms,| W,  14.6: 101.1-103.5  2.4 ms,| X,  14.8: 103.5-106.3  2.8 ms,| R,   2.5:  82.4- 85.1  2.6 ms,| R,  14.9: 106.3-108.9  2.6 ms,|

   7: W,   2.6:  85.1- 87.6  2.5 ms,| X,   2.6:  85.1- 87.6  2.5 ms,| W,  15.0: 108.9-111.4  2.4 ms,| X,  15.1: 111.4-114.2  2.8 ms,| R,   2.7:  87.6-118.1 30.5 ms,| R,  15.2: 114.2-118.0  3.8 ms,|

 

with clFlush()

buf_sz = 16 MB    data_sz=256 MB

Write/Kernel/Read  2 queues,  USE_HOST,

CMD, SUBMIT:  START-END TIME

   0: W,   0.1: 112.8-117.7  4.9 ms,| X,   0.1: 112.8-117.7  4.9 ms,| W,   0.1:   3.2-  8.1  4.9 ms,| X,   0.4:   5.3-  8.1  2.8 ms,| R,   0.3: 114.9-117.7  2.8 ms,| R,   8.1:  10.8- 15.6  4.9 ms,|

   1: W,   0.5: 115.1-117.7  2.6 ms,| X,   0.5: 115.1-117.7  2.6 ms,| W,   8.2:  16.0- 18.5  2.6 ms,| X,   8.3:  15.7- 18.5  2.8 ms,| R,   0.7: 112.6-117.7  5.1 ms,| R,   8.4: 100.6-117.0 16.4 ms,|

   2: W,   1.0: 115.3-117.7  2.5 ms,| X,   1.0: 115.3-117.7  2.5 ms,| W,   8.6: 110.8-117.1  6.4 ms,| X,   8.8: 114.3-117.2  2.8 ms,| R,   1.2: 101.0-117.8 16.8 ms,| R,   8.9: 105.6-117.2 11.6 ms,|

   3: W,   1.3: 111.4-117.8  6.4 ms,| X,   1.3: 111.4-117.8  6.4 ms,| W,   9.1: 114.7-117.2  2.5 ms,| X,   9.2: 114.4-117.2  2.8 ms,| R,   1.4: 115.2-117.8  2.6 ms,| R,   9.4:  97.4-117.2 19.8 ms,|

   4: W,   1.5: 115.3-117.8  2.5 ms,| X,   1.5: 115.3-117.8  2.5 ms,| W,   9.5: 112.4-117.3  4.8 ms,| X,   9.7: 114.4-117.3  2.8 ms,| R,   1.6: 105.9-117.8 11.9 ms,| R,   9.8: 112.5-117.3  4.7 ms,|

   5: W,   1.7: 115.4-117.8  2.4 ms,| X,   1.7: 115.4-117.8  2.4 ms,| W,  10.0: 114.7-117.3  2.6 ms,| X,  10.1: 114.5-117.3  2.8 ms,| R,   1.8:  97.6-117.8 20.2 ms,| R,  10.3: 114.7-117.3  2.6 ms,|

   6: W,   1.9: 113.0-117.8  4.8 ms,| X,   1.9: 113.0-117.8  4.8 ms,| W,  10.4: 114.9-117.3  2.4 ms,| X,  10.6: 114.5-117.3  2.8 ms,| R,   2.0: 115.1-117.8  2.6 ms,| R,  10.8: 114.7-117.3  2.6 ms,|

   7: W,   2.1: 115.2-117.8  2.6 ms,| X,   2.1: 115.2-117.8  2.6 ms,| W,  11.0: 114.9-117.3  2.4 ms,| X,  11.2: 114.5-117.4  2.8 ms,| R,   2.2:  90.8-117.8 27.0 ms,| R,  11.4: 114.3-117.6  3.4 ms,|

Attachments

Outcomes