cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

w00
Adept I

Execution of OpenCL Kernel on more than one GPU

Hello,

i want to execute different instances of the same kernel on 2 AMD Radeon HD 6870 at the same time.

Every kernel instance is called by one instance of the host program.

I have a AMD Sempron(tm) 140 Processor. It has 1 core only.

1 cpu core --> host program --> kernel instance

               ---> host program --> kernel instance

The problem is, that while 1 kernel is executed on 1 gpu, i can't start the other instance of the kernel on the other gpu.

The second gpu has to wait until the first kernel execution has finished.

Can i make it work?

or could i execute 2 kernels on 2 gpu with a 2 core cpu?

I'm running debian linux:

Linux node4 2.6.32-5-amd64 #1 SMP Tue May 13 16:34:35 UTC 2014 x86_64 GNU/Linux

amd graphic driver version:

[fglrx] module loaded - fglrx 14.10.2 [May  5 2014] with 2 minors

NOTE: I'm connecting to a shell by ssh.

greets

0 Likes
8 Replies
nou
Exemplar

Do you run two instances of your program? Are you sure that you run them on separate devices?

0 Likes

Hello,

thanks for your answer nou.

i did some tests.

Idle temperatures of the cards:

user@node4:~$ time amdconfig --adapter=all --odgt

Adapter 0 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 41.00 C

Adapter 1 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 36.00 C

real    0m0.010s

user    0m0.008s

sys     0m0.000s

If i run the host program on gpu nr. 1 i get the following temperatures:

user@node4:~$ time amdconfig --adapter=all --odgt

Adapter 0 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 53.00 C

Adapter 1 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 37.00 C

real    0m4.523s

user    0m0.000s

sys     0m0.008s

As you can see, amdconfig returns not immediately.

ps aux on pts/1 prints:

user      2287  119  3.9 215152 156824 pts/0   R+   20:05   0:02 ./main pci-e:02:00.0 1 10000000000

The same procedure for gpu nr. 2 gives me the following:

user@node4:~$ time amdconfig --adapter=all --odgt

Adapter 0 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 42.00 C

Adapter 1 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 50.00 C

real    0m0.010s

user    0m0.008s

sys     0m0.000s

As you can see, amdconfig returns immediately.

user      2290 13.3  0.8  93004 34756 pts/2    Sl+  20:06   0:02 ./main pci-e:03:00.0 1 10000000000

Now, 2 instances of the host program at the same time. I started the second host program some seconds later:

user      2329 24.1  0.8  92980 34752 pts/0    Sl+  20:14   0:02 ./main pci-e:02:00.0 1 10000000000

user      2331 95.7  0.3  80876 13892 pts/2    R+   20:14   0:03 ./main pci-e:03:00.0 1 10000000000

On return of the first host program:

user      2331 48.7  0.8  93020 34752 pts/2    Sl+  20:14   0:39 ./main pci-e:03:00.0 1 10000000000

The temperatures:

user@node4:~$ time amdconfig --adapter=all --odgt

Adapter 0 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 53.00 C

Adapter 1 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 38.00 C

real    0m35.927s

user    0m0.000s

sys     0m0.008s

It looks like the second card is not running. The return delay of the amdconfig call is very high.

35 sec looks like the time the first kernel returns and the amdconfig call can completed by the driver.

Now start of the host programs aprox at the same time:

user      2351 30.3  0.8  93028 34756 pts/0    Sl+  20:21   0:02 ./main pci-e:02:00.0 1 10000000000

user      2352 34.5  0.8  93028 34752 pts/2    Sl+  20:21   0:02 ./main pci-e:03:00.0 1 10000000000

the temperatures:

user@node4:~$ time amdconfig --adapter=all --odgt

Adapter 0 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 54.00 C

Adapter 1 - AMD Radeon HD 6800 Series

            Sensor 0: Temperature - 49.00 C

real    0m4.524s

user    0m0.004s

sys     0m0.008s

It looks like both cards can be used at the same time. the host program runtime is aprox the same.

In the previous case it was not.

Sorry for the long log and thanks for your help.

greets

0 Likes
maxdz8
Elite

You can enqueue (start) K kernels to D devices as long as you don't call clFlush, clFinish or other synchronizing memory barrier (Map, Read...) even on 1 single core.

If you want to get serious help, describe your program!

0 Likes

Hello maxdz8,

in my host program after enqueuing the kernel i use the following code:

err = clEnqueueReadBuffer(queue, output_buffer, CL_TRUE, 0,
      sizeof(output_data), output_data, 0, NULL, NULL);

    if (err < 0) {

   perror("Couldn't read the buffer");
   client_cleanup(EXIT_FAILURE);

    }

    clFlush(queue);

    clFinish(queue);

Its a blocking read (CL_TRUE). Should i avoid this?

The queue is for one card only.

I call the host program with variable parameterlist.

At the end of the host program i release all allocated cl objects.

greets

0 Likes

You  should really, really, really read the manual w00. You don't just get in programming by packing function calls without knowing what you're doing. Do you know what you're doing?

That snippet contains two blocking barriers (+1 interesting call):

  1. The read call will indeed block your thread until the buffer is copied back;
  2. clFlush - which is usually not needed - will correctly dispatch one queue to the associated device, unfortunately...
  3. clFinish will block your thread until GPU completes execution as per specification

Doing Flush immediately followed by a Finish is not just nonsense. In the end this does not really matter as the read call is basically a Finish by itself.

CL1.2, p121


If an application needs to wait for completion of a routine from the above list in a callback, please use the non-blocking form of the function, and assign a completion callback to it to do the remainder of your work. Note that when a callback (or other code) enqueues commands to a command-queue, the commands are not required to begin execution until the queue is flushed. In standard usage, blocking enqueue calls serve this role by implicitly flushing the queue. Since blocking calls are not permitted in callbacks, those callbacks that enqueue commands on a command queue should either call clFlush on the queue before returning or arrange for clFlush to be called later on another thread.



CL1.2 p195


clFlush (cl_command_queue command_queue) issues all previously queued OpenCL commands in command_queue to the device associated with command_queue. clFlush only guarantees that all queued commands to command_queue will eventually be submitted to the appropriate device. There is no guarantee that they will be complete after clFlush returns.


...


Any blocking commands queued in a command-queue and clReleaseCommandQueue perform an implicit flush of the command-queue. These blocking commands are clEnqueueReadBuffer, clEnqueueReadBufferRect, clEnqueueReadImage, with blocking_read set to CL_TRUE; clEnqueueWriteBuffer, clEnqueueWriteBufferRect, clEnqueueWriteImage with blocking_write set to CL_TRUE; clEnqueueMapBuffer, clEnqueueMapImage with blocking_map set to CL_TRUE; or clWaitForEvents.


...


clFinish (cl_command_queue command_queue) blocks until all previously queued OpenCL commands in command_queue are issued to the associated device and have completed. clFinish does not return until all previously queued commands in command_queue have been processed and completed. clFinish is also a


synchronization point



You're creating a sequential constraint across different queues, thereby causing them to work serially.

0 Likes

hello maxdz8,

thanks for your reply.

I read your message and i know about the blocking nature of this calls. I don't understand why one of the calls should block another host program from enqueueing commands in the queue of the second gpu.

host program nr 1 only handles gpu nr 1 and host program nr 2 handles gpu nr 2.

If i reduce my host program to:

...

/* Create a command queue */

    queue = clCreateCommandQueue(context, device, 0, &err);

    if (err < 0) {

   perror("Couldn't create a command queue");
   client_cleanup(EXIT_FAILURE);

    }

    /* Create a kernel */

    kernel = clCreateKernel(program, KERNEL_FUNC, &err);

    if (err < 0) {

   perror("Couldn't create a kernel");
   client_cleanup(EXIT_FAILURE);

    }

    /* Create kernel arguments */

    err = clSetKernelArg(kernel, 0, sizeof(input_buffer), &input_buffer);

    err |= clSetKernelArg(kernel, 1, sizeof(output_buffer), &output_buffer);

    if (err < 0) {

   perror("Couldn't create a kernel argument");
   client_cleanup(EXIT_FAILURE);

    }

    debug("enqueueing kernel\n");

    /* Enqueue kernel */

    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size,

     &local_size, 0, NULL, NULL);

    if (err < 0) {

   perror("Couldn't enqueue the kernel");
   client_cleanup(EXIT_FAILURE);

    }

    sleep(60);

/* Deallocate resources */

    clReleaseKernel(kernel);

    clReleaseMemObject(output_buffer);

    clReleaseMemObject(input_buffer);

    clReleaseCommandQueue(queue);

    clReleaseProgram(program);

    clReleaseContext(context);

    client_cleanup(EXIT_SUCCESS);

-----------------------------------------------------------------

Now i should be able to call a clXXX function (for example clGetPlatformIDs()) from host program nr.  2.

But every clXXX-call from host program nr 2. will block until the kernel from host program nr. 1 got finished.

At the moment i don't really know where the problem could be.

greets

0 Likes

try run on second gpu first. when is GPU busy with kernel execution it can't update display.

proper way to handle multiple GPU from single thread is enqueue kernels on all GPU, then clFlush() and lastly call some blocking sync function like clFinih() or clWaitEvents(); but you are running two instances of program so they should run in parallel. there is multiGPU example in SDK. try run that one and see if how does behave that one.

0 Likes
w00
Adept I

Hello,

i found a solution for my problem.

The host program is processing chunks of data in a loop now.

after every iteration, it waits for a small time to sync with other host programs running in parallel on the single core processor.

This way i can use both cards at the same time. Further i have the ability to interrupt the host program and kernel now.

Here is a link to the host program.

i set constants to get the best performance on my hardware:

#define PACKET_COUNT    8192

#define GROUP_SIZE    32

#define NUM_PER_THREAD    100000UL /* 64 bit for me */

#define CHUNK_SIZE    (PACKET_COUNT * NUM_PER_THREAD)

#define WAIT_SEC    0

#define WAIT_NSEC    100000000

execution example:

./main pci-e:02:00.0 '1;819201337'

begin: 1

end: 819201337

getocldev(): opencl capable devices found: 2

ocldevid(): preparing for search

ocldevid(): cleaning up device list

local_size: 32

global_size: 8192

chunk_count: 2

NUM_PER_THREAD: 100000

CHUNK_SIZE: 819200000

current_chunk: 0

current_chunk.first: 1

current_chunk.last: 819200000

packet_size: 100000

last_packet: 100000

last_packet_begin: 819100001

last_packet_end: 819200000

enqueueing kernel

[1,100000]: success

current_chunk: 1

current_chunk.first: 819200001

current_chunk.last: 819201337

packet_size: 0

last_packet: 1337

packet_size = 0

filling up 8191 packets with dummy data

last_packet_begin: 819200001

last_packet_end: 819201337

enqueueing kernel

greets