Hi,
I am trying to parallelize DMA and computation with two in order queues, facing strange Event Profiling results when clFlush() is being used.
On AMD GPU almost all events recorded with same/invalid time stamps. (Please refer to to the results stated below, marked in red)
This method is shared by "Nvidia_OpenCL_SDK_4_2_Linux.zip" (oclCopyComputeOverlap.cpp), and works fine on Nvidia GPU's.
Attached the source code for your reference. (in-order-2queues.cpp)
I am not able to figure out what could possibly going wrong with this approach on AMD.
Please advise.
Thank You
Regards,
-Pramod
Queues used: Two In order queues
Code snippet:
_________________________________________________________________________________________
for(uint8_t *ptr(data); ptr < data+dat_sz; ptr+=(2*buf_sz), i++) {
clEnqueueWriteBuffer(cqs[0],bufs[0],CL_FALSE,0,buf_sz,ptr,0,NULL,&evt[6*i+0]);
clFlush(cqs[0]);
clFlush(cqs[1]);
clSetKernelArg(kernel,0,sizeof(cl_mem),(void *)&bufs[0]);
clEnqueueNDRangeKernel(cqs[0],kernel,1,NULL,&gws,&lws,1,NULL,&evt[6*i+1]);
clEnqueueWriteBuffer(cqs[1],bufs[1],CL_FALSE,0,buf_sz,ptr+buf_sz,0,NULL,&evt[6*i+2]);
clFlush(cqs[0]);
clFlush(cqs[1]);
clSetKernelArg(kernel,0,sizeof(cl_mem),(void *)&bufs[1]);
clEnqueueNDRangeKernel(cqs[1],kernel,1,NULL,&gws,&lws,0,NULL,&evt[6*i+3]);
clEnqueueReadBuffer(cqs[0],bufs[0],CL_FALSE,0,buf_sz,ptr,0,NULL,&evt[6*i+4]);
clFlush(cqs[0]);
clFlush(cqs[1]);
clEnqueueReadBuffer(cqs[1],bufs[1],CL_FALSE,0,buf_sz,ptr+buf_sz,0,NULL,&evt[6*i+5]);
}
_________________________________________________________________________________________
Other Details:
Driver:
amd-catalyst-13.8-beta2-linux-x86.x86_64.run
SDK:
AMD-APP-SDK-v2.8.1.0-RC-lnx64.tgz
Device details from clinfo command
--------------------------------------------------------------
Platform Name: | AMD Accelerated Parallel Processing | |||
Number of devices: | 2 | |||
Device Type: | CL_DEVICE_TYPE_GPU | |||
Device ID: | 4098 | |||
Board name: | AMD Radeon HD 7900 Series | |||
Device Topology: | PCI[ B#15, D#0, F#0 ] |
Name: | Tahiti | ||||
Vendor: | Advanced Micro Devices, Inc. | ||||
Device OpenCL C version: | OpenCL C 1.2 | ||||
Driver version: | 1307.1 (VM) | ||||
Profile: | FULL_PROFILE | ||||
Version: | OpenCL 1.2 AMD-APP (1307.1) |
Results:
without clFlush()
buf_sz = 16 MB data_sz=256 MB
Write/Kernel/Read 2 queues, ALLOC_HOST,
CMD, SUBMIT: START-END TIME
0: W, 0.0: 0.7- 9.0 8.3 ms,| X, 0.0: 0.7- 9.0 8.3 ms,| W, 0.8: 2.2- 10.7 8.4 ms,| X, 2.2: 10.7- 17.7 7.1 ms,| R, 0.7: 9.0- 14.3 5.3 ms,| R, 9.9: 17.7- 69.2 51.4 ms,|
1: W, 2.2: 14.3- 18.1 3.8 ms,| X, 2.2: 14.3- 18.1 3.8 ms,| W, 68.7: 69.2- 71.6 2.4 ms,| X, 69.2: 71.6- 74.4 2.8 ms,| R, 2.6: 18.1- 38.3 20.1 ms,| R, 69.3: 74.4- 77.0 2.6 ms,|
2: W, 9.6: 38.3- 40.7 2.4 ms,| X, 9.6: 38.3- 40.7 2.4 ms,| W, 69.4: 77.0- 79.4 2.4 ms,| X, 69.6: 79.4- 82.3 2.8 ms,| R, 10.0: 40.7- 43.3 2.6 ms,| R, 69.7: 82.3- 84.9 2.6 ms,|
3: W, 10.2: 43.3- 45.8 2.5 ms,| X, 10.2: 43.3- 45.8 2.5 ms,| W, 69.8: 84.9- 87.3 2.4 ms,| X, 69.9: 87.3- 90.1 2.8 ms,| R, 10.3: 45.8- 48.4 2.6 ms,| R, 70.0: 90.1- 92.7 2.6 ms,|
4: W, 10.5: 48.4- 50.8 2.4 ms,| X, 10.5: 48.4- 50.8 2.4 ms,| W, 70.2: 92.7- 95.1 2.4 ms,| X, 70.3: 95.1- 98.0 2.8 ms,| R, 10.7: 50.8- 53.4 2.6 ms,| R, 70.4: 98.0-100.6 2.6 ms,|
5: W, 10.8: 53.4- 55.9 2.4 ms,| X, 10.8: 53.4- 55.9 2.4 ms,| W, 70.5: 100.6-103.0 2.4 ms,| X, 70.6: 103.0-105.8 2.8 ms,| R, 11.0: 55.9- 58.5 2.6 ms,| R, 70.7: 105.8-108.5 2.7 ms,|
6: W, 11.1: 58.5- 60.9 2.4 ms,| X, 11.1: 58.5- 60.9 2.4 ms,| W, 70.9: 108.5-110.9 2.4 ms,| X, 71.0: 110.9-113.7 2.8 ms,| R, 11.3: 60.9- 63.5 2.6 ms,| R, 71.1: 113.7-116.4 2.6 ms,|
7: W, 11.4: 63.5- 66.0 2.4 ms,| X, 11.4: 63.5- 66.0 2.4 ms,| W, 71.2: 116.4-118.8 2.4 ms,| X, 71.3: 118.8-121.6 2.8 ms,| R, 11.6: 66.0-124.7 58.7 ms,| R, 71.5: 121.6-124.4 2.8 ms,|
with clFlush()
buf_sz = 16 MB data_sz=256 MB
Write/Kernel/Read 2 queues, ALLOC_HOST,
CMD, SUBMIT: START-END TIME
0: W, 0.0: 122.0-126.7 4.7 ms,| X, 0.0: 122.0-126.7 4.7 ms,| W, 0.2: 3.2- 7.9 4.7 ms,| X, 0.6: 5.1- 7.9 2.8 ms,| R, 0.1: 123.8-126.7 2.8 ms,| R, 7.9: 10.9- 20.7 9.8 ms,|
1: W, 0.2: 123.6-126.7 3.1 ms,| X, 0.2: 123.6-126.7 3.1 ms,| W, 8.0: 20.7- 23.5 2.8 ms,| X, 8.1: 20.7- 23.5 2.8 ms,| R, 0.6: 117.2-126.7 9.5 ms,| R, 8.2: 106.5-123.0 16.5 ms,|
2: W, 0.9: 124.1-126.7 2.6 ms,| X, 0.9: 124.1-126.7 2.6 ms,| W, 8.4: 118.2-123.0 4.8 ms,| X, 8.5: 120.2-123.0 2.8 ms,| R, 1.2: 109.7-126.7 17.0 ms,| R, 8.6: 109.3-123.0 13.7 ms,|
3: W, 1.6: 121.8-126.7 4.9 ms,| X, 1.6: 121.8-126.7 4.9 ms,| W, 8.7: 120.5-123.0 2.5 ms,| X, 8.8: 120.2-123.0 2.8 ms,| R, 1.7: 124.1-126.7 2.6 ms,| R, 8.9: 109.1-123.0 13.9 ms,|
4: W, 1.8: 124.3-126.7 2.5 ms,| X, 1.8: 124.3-126.7 2.5 ms,| W, 9.1: 118.2-123.0 4.8 ms,| X, 9.2: 120.2-123.0 2.8 ms,| R, 2.2: 112.8-126.7 14.0 ms,| R, 9.3: 112.2-123.0 10.9 ms,|
5: W, 2.3: 124.3-126.8 2.4 ms,| X, 2.3: 124.3-126.8 2.4 ms,| W, 9.4: 120.5-123.0 2.6 ms,| X, 9.5: 120.2-123.0 2.8 ms,| R, 2.4: 112.4-126.8 14.3 ms,| R, 9.6: 120.4-123.1 2.6 ms,|
6: W, 2.7: 121.9-126.8 4.8 ms,| X, 2.7: 121.9-126.8 4.8 ms,| W, 9.8: 120.6-123.1 2.4 ms,| X, 10.0: 120.2-123.1 2.8 ms,| R, 2.8: 124.1-126.8 2.6 ms,| R, 10.2: 120.4-123.1 2.6 ms,|
7: W, 2.9: 124.2-126.8 2.6 ms,| X, 2.9: 124.2-126.8 2.6 ms,| W, 10.3: 120.6-123.1 2.4 ms,| X, 10.5: 120.3-123.1 2.8 ms,| R, 3.0: 90.6-126.9 36.3 ms,| R, 10.7: 120.2-126.6 6.4 ms,|
Data Matched
________________________________________________________________________________
without clFlush()
buf_sz = 16 MB data_sz=256 MB
Write/Kernel/Read 2 queues, USE_HOST,
CMD, SUBMIT: START-END TIME
0: W, 0.1: 0.4- 5.3 4.9 ms,| X, 0.1: 0.4- 5.3 4.9 ms,| W, 0.0: 0.5- 5.4 4.9 ms,| X, 0.6: 5.4- 8.2 2.8 ms,| R, 0.5: 5.3- 8.1 2.8 ms,| R, 1.7: 8.2- 13.3 5.1 ms,|
1: W, 0.7: 8.1- 10.6 2.6 ms,| X, 0.7: 8.1- 10.6 2.6 ms,| W, 13.1: 13.3- 16.3 3.0 ms,| X, 13.3: 16.3- 19.1 2.8 ms,| R, 0.8: 10.6- 15.5 4.9 ms,| R, 13.4: 19.1- 31.4 12.3 ms,|
2: W, 1.5: 15.5- 18.3 2.8 ms,| X, 1.5: 15.5- 18.3 2.8 ms,| W, 13.5: 31.4- 35.4 4.0 ms,| X, 13.6: 35.4- 38.2 2.8 ms,| R, 1.7: 18.3- 30.3 12.0 ms,| R, 13.7: 38.2- 46.5 8.3 ms,|
3: W, 1.8: 30.3- 34.4 4.1 ms,| X, 1.8: 30.3- 34.4 4.1 ms,| W, 13.8: 46.5- 49.2 2.7 ms,| X, 13.8: 49.2- 52.0 2.8 ms,| R, 1.9: 34.4- 37.4 3.0 ms,| R, 13.9: 52.0- 77.6 25.6 ms,|
4: W, 2.0: 37.4- 39.9 2.5 ms,| X, 2.0: 37.4- 39.9 2.5 ms,| W, 14.0: 77.6- 82.4 4.8 ms,| X, 14.1: 82.4- 85.3 2.8 ms,| R, 2.2: 39.9- 48.8 8.9 ms,| R, 14.2: 85.3- 93.0 7.8 ms,|
5: W, 2.2: 48.8- 51.4 2.6 ms,| X, 2.2: 48.8- 51.4 2.6 ms,| W, 14.3: 93.0- 95.6 2.6 ms,| X, 14.4: 95.6- 98.4 2.8 ms,| R, 2.3: 51.4- 77.6 26.2 ms,| R, 14.5: 98.4-101.1 2.6 ms,|
6: W, 2.4: 77.6- 82.4 4.8 ms,| X, 2.4: 77.6- 82.4 4.8 ms,| W, 14.6: 101.1-103.5 2.4 ms,| X, 14.8: 103.5-106.3 2.8 ms,| R, 2.5: 82.4- 85.1 2.6 ms,| R, 14.9: 106.3-108.9 2.6 ms,|
7: W, 2.6: 85.1- 87.6 2.5 ms,| X, 2.6: 85.1- 87.6 2.5 ms,| W, 15.0: 108.9-111.4 2.4 ms,| X, 15.1: 111.4-114.2 2.8 ms,| R, 2.7: 87.6-118.1 30.5 ms,| R, 15.2: 114.2-118.0 3.8 ms,|
with clFlush()
buf_sz = 16 MB data_sz=256 MB
Write/Kernel/Read 2 queues, USE_HOST,
CMD, SUBMIT: START-END TIME
0: W, 0.1: 112.8-117.7 4.9 ms,| X, 0.1: 112.8-117.7 4.9 ms,| W, 0.1: 3.2- 8.1 4.9 ms,| X, 0.4: 5.3- 8.1 2.8 ms,| R, 0.3: 114.9-117.7 2.8 ms,| R, 8.1: 10.8- 15.6 4.9 ms,|
1: W, 0.5: 115.1-117.7 2.6 ms,| X, 0.5: 115.1-117.7 2.6 ms,| W, 8.2: 16.0- 18.5 2.6 ms,| X, 8.3: 15.7- 18.5 2.8 ms,| R, 0.7: 112.6-117.7 5.1 ms,| R, 8.4: 100.6-117.0 16.4 ms,|
2: W, 1.0: 115.3-117.7 2.5 ms,| X, 1.0: 115.3-117.7 2.5 ms,| W, 8.6: 110.8-117.1 6.4 ms,| X, 8.8: 114.3-117.2 2.8 ms,| R, 1.2: 101.0-117.8 16.8 ms,| R, 8.9: 105.6-117.2 11.6 ms,|
3: W, 1.3: 111.4-117.8 6.4 ms,| X, 1.3: 111.4-117.8 6.4 ms,| W, 9.1: 114.7-117.2 2.5 ms,| X, 9.2: 114.4-117.2 2.8 ms,| R, 1.4: 115.2-117.8 2.6 ms,| R, 9.4: 97.4-117.2 19.8 ms,|
4: W, 1.5: 115.3-117.8 2.5 ms,| X, 1.5: 115.3-117.8 2.5 ms,| W, 9.5: 112.4-117.3 4.8 ms,| X, 9.7: 114.4-117.3 2.8 ms,| R, 1.6: 105.9-117.8 11.9 ms,| R, 9.8: 112.5-117.3 4.7 ms,|
5: W, 1.7: 115.4-117.8 2.4 ms,| X, 1.7: 115.4-117.8 2.4 ms,| W, 10.0: 114.7-117.3 2.6 ms,| X, 10.1: 114.5-117.3 2.8 ms,| R, 1.8: 97.6-117.8 20.2 ms,| R, 10.3: 114.7-117.3 2.6 ms,|
6: W, 1.9: 113.0-117.8 4.8 ms,| X, 1.9: 113.0-117.8 4.8 ms,| W, 10.4: 114.9-117.3 2.4 ms,| X, 10.6: 114.5-117.3 2.8 ms,| R, 2.0: 115.1-117.8 2.6 ms,| R, 10.8: 114.7-117.3 2.6 ms,|
7: W, 2.1: 115.2-117.8 2.6 ms,| X, 2.1: 115.2-117.8 2.6 ms,| W, 11.0: 114.9-117.3 2.4 ms,| X, 11.2: 114.5-117.4 2.8 ms,| R, 2.2: 90.8-117.8 27.0 ms,| R, 11.4: 114.3-117.6 3.4 ms,|
Hi I am able to reproduce the same result here. I will forward it to concern people and get back to you. Meanwhile i want you to check the OpenCL spec 1.2, it may help you in sorting out this problem.
1. The program itself contains 2 bugs:
a) line 64 clEnqueueNDRangeKernel(cqs[0],kernel,1,NULL,&gws,&lws,1,NULL,&evt[6*i+1]);
You send 1 for the number of wait events, but the list is NULL. Runtime fails that call
b) From line 79
uint8_t *p(data);
if(use_clflush) {
::memset(tempBuf0,0,buf_sz);
memcpy(tempBuf0,p,buf_sz);
} else {
::memset(tempBuf1,0,buf_sz);
memcpy(tempBuf1,p,buf_sz);
}
The code will copy data without actual wait for the clEnqueueReadBuffer calls. The program has to call clFinish/clWaitForEvents or clEnqueueReadBuffer should be a blocking command.
2. The execution times are correct and match the case without clFlush(), but the timeline is broken. The incorrect timeline reported for one queue will be fixed in the new driver releases. OpenCL runtime has to readjust reported GPU timer values to the CPU time line. There are sync points when calibration occurs. With frequent clFlushes the sync point could get a wrong value for the CPU tick. Basically it was the end of entire execution. That's why you see almost the same value for the end in all commands on one queue.
Hi,
Thank You for pointing out the bugs.
Waiting for new driver release with timeline fix.
Thank You
Regards
-Pramod
Hi Thanks for your reply Andryeyev. Pramodv_a please fix your bugs and try with new driver once released. thanks