Hi, everyone!
I found a problem about concurrent kernels execution on multiple GPUs.
There are two AMD Radeon HD 7850 GPUs on my computer, and the OS is Ubuntu 12.04 LTS.
I installed the AMD APP SDK and GPU driver according to following steps:
Then, I wrote a testing program:
The program gets the two devices first, and create separate command queues respectively.
Each command queue is issued one kernel, and each kernel are executed independently.
However, I found the execution order of two kernel is not concurrent,and it is serial even if I use the separate context including respective device.
How do I utilize the two GPUs at the same time?
Best regards
Joen
How do you enqueue kernels? Proper way from one thread is to enqueue kernel to each queue, then flush them and then call wait for events or finish. AMD implementation doesn't start execution of task until queue is flushed. Flush is called by blocking operations implicitly.
Host program: http://codeviewer.org/view/code:2c31
Kernel Code: http://codeviewer.org/view/code:2c30
Yes, I enqueued multiple kernels, and each device use one thread to enqueue itself kernel.
However, the result is that the two kernels seem to be executed serially.
Best regards
Joen
I don't see anything wrong with code. But global size 1024 may be too small and one invocation too little to see actual concurrent execution. Use bigger global size like 10000 or more and enqueue kernel several time on one device. Because by time that CPU begin enqueue to second device first is most likely finished.
Also what method do you use to examine kernel execution order? CodeXL profiler?
I enlarge the worksize from 1024 to 10240 and used the CodeXL to profiles the process.
These is my result
Modified Host program: http://codeviewer.org/view/code:2c45
Modified Kernel code: http://codeviewer.org/view/code:2c44
Using 1 devices and 4 kernels for each device: http://www.csie.ntu.edu.tw/~r00922126/experiment/1dev_4kernels.png
Time used: 23.7s
Using 2 devices and 2 kernels for each device: http://www.csie.ntu.edu.tw/~r00922126/experiment/2dev_2kernels.png
Time used: 23.9 s
Using 1 devices and 8 kernels for each device: http://www.csie.ntu.edu.tw/~r00922126/experiment/1dev_8kernels.png
Time used: 47.3s
Using 2 devices and 4 kernels for each device: http://www.csie.ntu.edu.tw/~r00922126/experiment/2dev_4kernels.png
Time used: 47.5s
Profiling logs: http://www.csie.ntu.edu.tw/~r00922126/experiment/experiment.zip
Although these kernels are concurrent exectution from these pictures, but it seems to be a bug because the time interval from clfinish() to starting to execute the kernel is too long.
Do you see "enqueue kernel %d into device\n" and
"kernel %d done\n" printed in parallel or one after each other?
How come your workgroup size is 0 ?
Sorry, this may be a bug, but I assigned the local work item size as 256. The result is the same with the above results.
And I will see the enqueue kernel printed in parallel.
Ah yes I saw this already. I don't know if it is bug with profiler or something. Don't use profile information in OpenCL and try time execution time yourself. That mean measure execution time when you are using only one device and when two.
I used the unix command "time" to measure the execution time, but the result is the same.
Best regards
Joen
try look at SimpleMultiDevice example from SDK.
The following message is the result of performming SimepleMultiDevice
----------------------------------------------------------
CPU + GPU Test 1 : Single context Single Thread
----------------------------------------------------------
Total time : 43
Time of CPU : 42.7046
Time of GPU : 2.12667
----------------------------------------------------------
CPU + GPU Test 2 : Multiple context Single Thread
----------------------------------------------------------
Total time : 43
Time of CPU : 58.2919
Time of GPU : 1.27081
----------------------------------------------------------
CPU + GPU Test 3 : Multiple context Multiple Thread
----------------------------------------------------------
Total time : 45
Time of CPU : 44.2755
Time of GPU : 1.27111
----------------------------------------------------------
Multi GPU Test 1 : Single context Single Thread
----------------------------------------------------------
Total time : 4
Time of GPU0 : 1.20148
Time of GPU1 : 1.20104
----------------------------------------------------------
Multi GPU Test 2 : Multiple context Single Thread
----------------------------------------------------------
Total time : 5
Time of GPU0 : 1.26904
Time of GPU1 : 1.26919
----------------------------------------------------------
Multi GPU Test 3 : Multiple context Multiple Thread
----------------------------------------------------------
Total time : 4
Time of GPU0 : 1.2757
Time of GPU1 : 1.20118
The order of execution seems to be serial execution. Further, I watched the profiling picture from CodeXL, and the order of execution is serial execution, too.
Best regards
Joen
check http://devgurus.amd.com/message/1285364#1285364
it seems you can't see any transfer overlap within profiler. Also check out the flag, it might help.
Finally, I changed the the GPU driver to Catalyst 12.6a. The kernel execution is concurrent now!
It seems to be a bug in the latest driver Catalyst 12.10.