cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

joen
Journeyman III

Concurrent execution on multiple GPUs

Hi, everyone!

I found a problem about concurrent kernels execution on multiple GPUs.

There are two AMD Radeon HD 7850 GPUs on my computer, and the OS is Ubuntu 12.04 LTS.

I installed the AMD APP SDK and GPU driver according to following steps:

  1. install the GPU driver, Catalyst 12.10
  2. install the AMD APP SDK 2.7
  3. set COMPUTE=:0
  4. set DISPLAY=:0
  5. aticonfig --adapter=all --inital -f
  6. reboot

Then, I wrote a testing program:

The program gets the two devices first, and create separate command queues respectively.

Each command queue is issued one kernel, and each kernel are executed independently.

However, I found the execution order of two kernel is not concurrent,and it is serial even if I use the separate context including respective device.

How do I utilize the two GPUs at the same time?

Best regards

Joen

0 Likes
12 Replies
nou
Exemplar

How do you enqueue kernels? Proper way from one thread is to enqueue kernel to each queue, then flush them and then call wait for events or finish. AMD implementation doesn't start execution of task until queue is flushed. Flush is called by blocking operations implicitly.

0 Likes
joen
Journeyman III

Host program: http://codeviewer.org/view/code:2c31

Kernel Code: http://codeviewer.org/view/code:2c30

Yes, I enqueued multiple kernels, and each device use one thread to enqueue itself kernel.

However, the result is that the two kernels seem to be executed serially.

Best regards

Joen

0 Likes

I don't see anything wrong with code. But global size 1024 may be too small and one invocation too little to see actual concurrent execution. Use bigger global size like 10000 or more and enqueue kernel several time on one device. Because by time that CPU begin enqueue to second device first is most likely finished.

Also what method do you use to examine kernel execution order? CodeXL profiler?

0 Likes
joen
Journeyman III

I enlarge the worksize from 1024 to 10240 and used the CodeXL to profiles the process.

These is my result

Modified Host program: http://codeviewer.org/view/code:2c45

Modified Kernel code: http://codeviewer.org/view/code:2c44

Using 1 devices and 4 kernels for each device: http://www.csie.ntu.edu.tw/~r00922126/experiment/1dev_4kernels.png

Time used: 23.7s

Using 2 devices and 2 kernels for each device: http://www.csie.ntu.edu.tw/~r00922126/experiment/2dev_2kernels.png

Time used: 23.9 s

Using 1 devices and 8 kernels for each device: http://www.csie.ntu.edu.tw/~r00922126/experiment/1dev_8kernels.png

Time used: 47.3s

Using 2 devices and 4 kernels for each device: http://www.csie.ntu.edu.tw/~r00922126/experiment/2dev_4kernels.png

Time used: 47.5s

Profiling logs: http://www.csie.ntu.edu.tw/~r00922126/experiment/experiment.zip

Although these kernels are concurrent exectution from these pictures, but it seems to be a bug because the time interval from clfinish() to starting to execute the kernel is too long.

0 Likes

Do you see "enqueue kernel %d into device\n" and
"kernel %d done\n" printed in parallel or one after each other?

How come your workgroup size is 0 ?

  • _ED(err = clEnqueueNDRangeKernel(queue[id], kernel[id], 1, 0, &work_size, 0, 0, 0, &event[id]));

0 Likes
joen
Journeyman III

Sorry, this may be a bug, but I assigned the local work item size as 256. The result is the same with the above results.

And I will see the enqueue kernel printed in parallel.

0 Likes

Ah yes I saw this already. I don't know if it is bug with profiler or something. Don't use profile information in OpenCL and try time execution time yourself. That mean measure execution time when you are using only one device and when two.

0 Likes
joen
Journeyman III

I used the unix command "time" to measure the execution time, but the result is the same.

Best regards

Joen

0 Likes

try look at SimpleMultiDevice example from SDK.

0 Likes
joen
Journeyman III

The following message is the result of performming SimepleMultiDevice

----------------------------------------------------------

CPU + GPU Test 1 : Single context Single Thread

----------------------------------------------------------

Total time : 43

Time of CPU : 42.7046

Time of GPU : 2.12667

----------------------------------------------------------

CPU + GPU Test 2 : Multiple context Single Thread

----------------------------------------------------------

Total time : 43

Time of CPU : 58.2919

Time of GPU : 1.27081

----------------------------------------------------------

CPU + GPU Test 3 : Multiple context Multiple Thread

----------------------------------------------------------

Total time : 45

Time of CPU : 44.2755

Time of GPU : 1.27111

----------------------------------------------------------

Multi GPU Test 1 : Single context Single Thread

----------------------------------------------------------

Total time : 4

Time of GPU0 : 1.20148

Time of GPU1 : 1.20104

----------------------------------------------------------

Multi GPU Test 2 : Multiple context Single Thread

----------------------------------------------------------

Total time : 5

Time of GPU0 : 1.26904

Time of GPU1 : 1.26919

----------------------------------------------------------

Multi GPU Test 3 : Multiple context Multiple Thread

----------------------------------------------------------

Total time : 4

Time of GPU0 : 1.2757

Time of GPU1 : 1.20118

The order of execution seems to be serial execution. Further,  I watched the profiling picture from CodeXL, and the order of execution is serial execution, too.

Best regards

Joen

0 Likes

check http://devgurus.amd.com/message/1285364#1285364

it seems you can't see any transfer overlap within profiler. Also check out the flag, it might help.

0 Likes
joen
Journeyman III

Finally, I changed the the GPU driver to Catalyst 12.6a. The kernel execution is concurrent now!

It seems to be a bug in the latest driver Catalyst 12.10.

0 Likes