Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept II

OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).

Pavel Bogdanov, Institute of System Research Russian Academy of Sciences (NIISI),


Nowadays heterogeneous computing becomes more and more popular. In november 2011 three of top5 supercomputers had hybrid architecture. In the last version (june 2012) of this list they were pushed out by computers with new IBM BlueGene/Q architecture, built with standard multicore processors. But according to experts, it is impossible to construct an exaflop-class supercomputer using only multicore CPUs, so these computers may have hybrid architecture.

Modern hybrid machines are built with massively-parallel accelerators, such as graphic cards (GPGPU). Intel MIC and others. How can these devices be programmed? At the moment there are two stable programming models: CUDA and OpenCL. CUDA is an NVidia proprietary product and can be used only with NVidia hardware. OpenCL is an open model, it is supported by a number of hardware vendors such as AMD, Intel, IBM, including NVidia, and the code written using this model is highly portable between these devices.

Now the next question: how to use these powerful massively-parallel devices to accelerate existimg MPI/OpenMP codes? The simple approach — to use optimized for these devices numerical libraries — will not work: the amount of changes to the existing code will be large, and it is as difficult as entirely rewrite the code.

One of the possible ways out of this problem is to create a special infrastructure (possibly made as a system-level service), which would on the one side control the devices and regulate access to them and on the other provide a comfortable API for the developer. The main concept of such infrastructure is to minimize the changes which must be made to an existing MPI/OpenMP code to work with OpenCL model.

Our work has two main purposes: to create such infrastructure for heterogeneous computing and to accumulate numerical libraries with standard interfaces, written with this infrastructure. The priorities are: the best performance, scalability to all devices in hybrid machine, easy coding and automatic transfer-overlap when possible.

Further, we'll describe the first version of the infrastructue and three model methods of linear algebra: DGEMM — general dense matrix multiplication, DTRSM — a triangle system of linear equiations solver, and DGETRF — LU-decomposition.



We introduce two definitions: register and instruction.

Register is an arbitrary linear region in a device global memory. So, the global memory is divided into several registers, their number and size are bounded only with total amount of memory and some OpenCL restrictions (such as maximum allocation size and etc).

Instruction is an arbitrary compute kernel on a device. The whole device can be interpreted as a virtual machine which runs instructions on registers.

The devices are operated by a special scheduler. Scheduler initializes the available resources, and provides a developer with following abilities: to control the registers, to transfer data between host and devices, and to launch compute kernels on devices. These abilities are used by the means of  control commands. Commands and dependencies between them form a dependency graph, which is a program for a scheduler.

At the moment, the scheduler works statically. It means, we firstly create the whole program, then execute it.


There are three types of commands: data transfer, kernel execution and auxiliary commands.

LD command reads a rectangular region from host memory and sends it into register. At the moment, it stores the lines sequentally into a small buffer and sends it to device via OpenCL call. If the device is connectd to a PCI-Express slot, memcpy and transfer are performed in parallel. LD command receives as arguments a register and region parameters.

ST command stores data from register to a rectangle region. The algorithm and arguments are the same as with LD command.

EXEC command (an instruction in our terminology) executes an arbitrary compute kernel on a device. It takes registers and scalar control parameters as arguments.

Auxiliary commands are presented with MARK (marker) and BAR (barrier) commands used for synchronization. MARK is an empty command, from which dependencies can be created, and BAR simply waits for all commands it depends of.


The scheduler consists of three queues: queue of LD commands, queue of ST commands and queue of EXEC commands. Auxiliary commands can be included in every queue. All three queues work in parallel, each queue works in a separate thread, the thread syncronization is performed by standard system-level objects and calls. EXEC queue works strictly in-order, and the next command is not launched until previous command is finished. LD and ST work asyncronously: commands are launched in order they were added, but the next command may start before previous finishes.


Tests were launched in Institute of System Resarch Russian Academy of Science (NIISI) on the following stand:

CPU               2 x AMD Opteron 6176

Memory          64 Gb DDR3

GPU               3 x AMD Radeon 6990

Operating system     Ubuntu Linux, Windows 7

Video driver           AMD Catalyst 12.3



Calculate C = alpha*A*B + beta*C, A, B, C — general dense matrices of an arbitrary size.


We use a standard blocking algorithm with two levels of blocking. In the host memory matrices are cut into big blocks (in regular case 5184x5184 elems in double precision, which takes approx. 205 MBytes of memory), which are transferred to a device and there are cut into smaller blocks. One thread    computes a 6x6 block of the result matrix. The blocks on the host are taken consequentally.

When there are several devices, we divide rows of resulting matrix equally between devices (one device multiplies part of rows of A on the whole matrix B), so algorithm is data-parallel.

In irregular case we use padding: fill matrices with additional zeros to get appropriate sizes.

Pics clBLAS A4 - 3.jpg

Scheduler program

Algorithm: first, we should divide matrix C into blocks, then write programs, which calculate each blocks, and then join them into one program. If a block from C fits into one register, a program for a scheduler which compute it, looks like this:

LD     C → rc

FOR (k = 0; k < nblocks; k++)

LD     Ak → ra

LD     Bk → rb

EXEC     pad ra → rpa

EXEC     pad rb → rpb

EXEC     dgemm rpa, rpb, rc


ST     rc → C

Padding depends on LD's, gemm depends on padding. The dependencies on register access are put automatically, EXEC queue is executed consequentally, so no additional dependencies should be done.

If OpenCL driver and a device suppor transfer overlapping, LD commands are hidden behind EXEC dgemm. To hide LD C and ST C we use double-buffering.

So, algorithm requires five registers to work and six to use double buffering.

Pics clBLAS A4 - 2.jpg

Computational kernel

One thread computes 6x6 block of C (it multiplies six rows on six columns): in a cycle we load 6x2 block from A and 2x6 block from B into registers and multiply them. LDS is not used. Performance: we have to load 24 doubles and make 72 mads on them, so theoretical peak is 75% of card peak. We achieve 63% of card peak on a kernel.

Pics clBLAS A4 - 1.jpg


There is a question: how to split matrix C into blocks? The goal was to get as smooth curve of performance as possible. Let's assume that matrix has m rows, and regular block has r rows. We'll make m/r-1 rows of blocks (each containing r rows), and the last (r+m mod r) rows we divide into two block rows with equal size. The columns are splitted in the same way. This partition provides the stable performance dependency from size.


Pics clBLAS A4 - 4.jpg



DTRSM one of triangle systems: op(A)*X = B or X*op(A) = B. Result X is placed into B.


We use the parallel form of the back substitution part of the Gaussian elimination, applied to the blocked matrix. The algorithm is the following:

FOR (k = 0; k < nblocks; k++)

Inverse Akk

Calculate Xk. For example, if DTRSM parameters are LUNN, Xk = Akk^-1 * Bk

if (k != nblocks-1)

Update trailing matrix: Bkk -= Ak*Xk


So we see, that DTRSM performance is asymptotically equial to DGEMM performance.

Triangle matrix Akk is inversed in a block way. Firstly, we use a standard algorithm to inverse diagonal 32x32 blocks in-place, and then apply the same method to inverse block matrix.

When we have several cards, we split matrix B equally between cards (data-parallel). All cards do the same matrix inversion, but it is not resource-consuming, so it doesn't affect the result performance.


Scheduler program

Akk is inversed on the card, so the original matrix is not changed. Answer Xkl is written on the place of Bkl. In the first cycle we get the inversed block Akk, in the second — part of the answer Xk.

LD     Akk → rpakk

EX     inverse_diag_blocks rpakk

FOR     (l = 0; l < nblocks; l++)

EX     dgemm_step1     rpakk, l

EX     dgemm_step2     rpakk, l


FOR     (l = 0; l < npages; l++)

LD     Bl → rbl

EX     pad rbl → rpbl

EX     dgemm rpakk, rpbl, rc

ST     rc → Bl


Additional dependencies here are not nesessary too: all required dependencies will be set automatically.

If Akk is not the last block, we shoul update trailing matrix via DGEMM.

Pics clBLAS A4 - 5.jpg

Computational kernel

Diagonal 32x32 blocks are inversed by a kernel with a standard algorithm (netlib). We use LDS and a «one thread — one row of 32 elements» principle. Other kernels — our dense dgemm.

For triangle matrix multiplication we use optimized for this matrix structure dgemm kernels. From asymptotic point of view, it doesn't matter, which kernel to use, but optimized kernels allow to achieve better performance on smaller matrices.


Pics clBLAS A4 - 7.jpg



DGETRF performs an LU decomposition of matrix A with partial pivoting: A = PLU.


As a model task we implemented a simpliest case of LU decomposition: decomposition without pivoting. Our goal was to investigate, whether our approach can be applied to such problems. This method can be used with the narrow class of matrices, but it's theoretical performance is almost the same as performance of gneral method.

Algorithm can be written as a sequience of standard calls:

FOR (k = 0; k < nblocks, k++)

CALL          CPU_DGETRF (Akk)

IF (k != nblocks - 1)

CALL          GPU_DTRSM(L, L, N, U, Akk, A')

CALL          GPU_DTRSM(R, U, N, N, Akk, A1)

CALL          GPU_DGEMM(A', A1, Ak)



GPU calls can be done in parallel on several devices, CPU code works consequentially. Overall loss of such use of CPU increases with increasing number of GPUs. Asymptotically, the performance of the call is equal to DGEMM on stripes, but CPU code makes it to converge slowly.

There is a way to hide CPU code behind GPU calculation, but it makes algorithm more complex and is not required for our goal.

Pics clBLAS A4 - 8.jpg

Algorithm with pivoting

An algorithm with string pivoting has two major differences from simple method: DGETRF_CPU is called not on a square region, but on a whole block column, and we should switch rows according to pivoting array. Asymptotically, these operations are much cheaper than DGEMM, so they can be hidden behind it.


Pics clBLAS A4 - 9.jpg


In short, we can make several conclusions from our work: firstly, massively-parallel accelerators can be effectively applied for such tasks in mathematical modelling; secondly, infrastructure satisfies the goals, and it can be used to program hybrid nodes.

In future, we plan to develop this infrastructure on one hand, and accumulate numerical libraries written with it on the other. Now we are working on operations with sparse matrices: SpMV and numerical methods using it.

We send our best regards to AMD, and a few requests:

- Please allow us to allocate 100% of GPU space under linux!

- Please make OpenCL driver work with multi-GPU properly with any number of GPUs (at least, 8 or 16)!

- Please make WriteBufferRect work with transfer overlap!

- Please provide full and correct support of OpenCL 1.2!

From Russia with love,

Pavel, Anton.

Message was edited by: anton efremov

43 Replies

Re: OpenCL programming infrastructure for heterogeneous computing and its applications: multi-GPU DGEMM, DTRSM, DGETRF

Thank you for your feedback. I've passed it on to the right people at AMD.



Adept II

Re: OpenCL programming infrastructure for heterogeneous computing and its applications: multi-GPU DGEMM, DTRSM, DGETRF

Hi everybody, a little update. We managed to launch the codes on Radeon 7970 (at the moment, we have three of them), and here are the first results, without additional optimizations.

Pics clBLAS A4 - 10.jpg

Pics clBLAS A4 - 11(1).jpg

Pics clBLAS A4 - 12(1).jpg

A few global remarks about performance:

1. To cover all sizes and transposition cases, we use padding procedures, and it cost ~1.5% of performance

2. It seems that transfer overlap is not free of charge. We lose ~3.5% of performance on a kernel when using transfer overlap compared to consequient code. Synthetic tests show the same performance loss.

3. When we launch kernels consequently, the average kernel performance is significantly lower than in single launches. We are trying to work out why it happens. Expensive NDRange call? Or the card should just relax for some time after a 0.6 secs of hard work?

4. When using several devices, an overhead (transfer first and last block via PCI-express, which could not be hidden) increases in proportion to the number of devices.

There is a simple recipe how to increase overall performance: to increase gemm kernel performance. A good approach can be found here:

About scaling. Three 7970 give us 2,77 TFlop/s peak in double precision, which is close to 5 6970 cards, and the reasons why we don't see good scaling on three devices are the same as there: we don't have appropriate amount of work for such computational power (+ multidevice overhead).

From Russia with love,

Pavel, Anton.

Adept II

Re: OpenCL programming infrastructure for heterogeneous computing and its applications: multi-GPU DGEMM, DTRSM, DGETRF

Hi everybody, we're here again with greetings from windy and cold Mother Russia!

Agenda: some fresh results with 8 GPUs and a humble request for proper data transfer routines in AMD OpenCL drivers.



We see that scaling is good up to 7 devices. Now a few words about the curves (why they are not as smooth as they should be) and scaling: we can't achieve an acceptable data transfer speed. We face serious problems with transferring data from host memory (allocated in DDR via operators like malloc, new and etc) to global memory on a device. Now some details.

Data transfer.


double *src = new double [ 200 MB ]

cl_mem buf = clCreateBuffer( flags, 200 MB);

double t1 = gettime();

[data transfer]

double time = gettime() - t1;

speed = 200MB / time  = ~1.5 GB/sec

[data transfer] could be each of 1-5 scenarios from Programming Guide (we also tried several "custom" scenarios - splitting buffer into smaller pieces and etc.) - with the same results.

One again, we tried APP SDK 2.6, APP SDK 2.7, AMD CATALYST from 12.1 to 12.9 - on all possible configurations for the last year.

More than that, AMD SDK BufferBandwidth example show the same results. For example, assume we make map buffer, memcpy and unmap. We see the following picture:

memcpy        2.5 GB/sec

map/unmap    5.7 GB/sec

Now, average speed = size / time = size / (memcpytime + maptime) = size / ( size/memcpyspeed + size/mapspeed) = 1 / (1/memcpyspeed + 1/mapspeed) = mapspeed*memcpyspeed / (mapspeed + memcpyspeed)

Let's apply it: 2.5*5.7 / (2.5 + 5.7) = 1.7 GB/sec in average.

Transfer Overlap.

Transfer overlap (kernel execution and PCIe-data transfer at the same time) in all drivers could be achieved only with clEnqueueWriteBuffer OpenCL API call. So, if we want to transfer rect, we have to merge it into a linear region (lots of memcpy's) and then transfer via WriteBuffer. In this case, average is significantly slower than already slow linear transfer.

If we use other calls (such as clEnqueueReadBuffer), total performance of transfer and kernel execution is equial to performance of consequent launches, so there is no transfer overlap effect.

Transfer overlap example from SDK does three things: memcpy, PCIe transfer and kernel execution. We tried all launch cases, and we see only memcpy & kernel execution working in parallel. PCIe transfer and kernel execution work consequently.


We are grateful for all advices, but to read Programming Guide and install latest drivers is not the advice we expected to get.

Once again, there are two requests:

1) FULL PCIe bandwidth on standard OpenCL calls, or, at least, on  scenarios from Programming Guide.

double *src = new double [ 200 MB ]

cl_mem buf = clCreateBuffer( flags, 200 MB);

double t1 = gettime();

clEnqueueWriteBuffer( flags, 200MB)

double time = gettime() - t1;

speed = 200MB / time ~ 5.7 GB/sec     (!!)

2)Transfer overlap working properly(!) with clEnqueueWriteBuffer, clEnqueueReadBuffer, clEnqueueWriteBufferRect, clEnqueueReadBufferRect and/or clEnqueueMapBuffer & clEnqueueUnmapMemObject on full bandwidth. "Properly" means "hide the whole PCIe transfer over kernel execution"

Ideal situation is the following: GPU executes kernels one after another without gaps, and data transfer is completly hidden behind this execution. We are working hard to provide such system, but AMD drivers at the moment are not good enough to support it.

And, as the proper bug report should end with machine configuration, here is ours:

2 x AMD Opteron 6176 SE

SuperMicro MNL-H8DG6-F

64 Gb DDR3

8 x AMD Radeon 7970

Ubuntu 12.04 LTS

From Russia with love,

Pavel, Anton.

Adept II

Re: OpenCL 8 GPU DGEMM. Programming infrastructure for heterogeneous computing and its applications.

Hello from Russia!

We continue our uphill battle with AMD OpenCL drivers! We struggle to prove that AMD GPU chips can be used in HPC,and there is some light at end of tunnel.

Tonight in the show:

- 4,4 TFlops double precision on 8 GPU's: new dgemm results, blocksize and CPU influence

- clEnqueueWriteBuffer to GPU performance in context with CPU device

- some thoughts about new Tesla K20


Let's start from new dgemm results. There are some moments i'd like to mention. In short, we use the following algorithm. We cut matrices in the host memory to squares (say, 5760 x 5760), memcpy them to linear pieces, send them to GPU via clEnqueueWriteBuffer, pad them with zeros to proper size, do dgemm and read results via MapBuffer and memcpy's from linear piece on device to square in the host.

Everything could be much more simple if clEnqueueWriteBufferRect and clEnqueueReadBufferRect worked asynchronously with kernels, even on slow speed...

So, the first thing is that if we put the CPU (AMD Opteron 6176 SE) to the performance mode ("performance" > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor), we get significantly faster results. This is a little strange effect, because CPU does only memcpy's and some AMD OpenCL driver calls.

The second thing is about blocking size. We found out that kernels work 2x slower on size 6144x6144. Why this is not good: imagine, we do 48384 (7x7 blocks 6912x6912) on 4 devices. One device gets 48384 / 4 = 12096 rows of a matrix. The program cuts it into two block rows of 6048, and then it is padded to multiple of 192 (to fit workgroup size), resulting in 6144. And there are lots of other examples when this size occur.




Once again (see previous post), average PCIe speed is low, and to decrease this effect, we have to use bigger block size (n^3 operations vs n^2 data transfer on dgemm) - limited only with 3Gb device memory. We see that enqueueWriteBuffer calls are overlapped with kernels, but sometimes data is not transferred in time. To illustrate this effect, we provide some pictures from profiler, for one device (all good) and 8 devices (transfer is not fast enough to feed 8 GPUs) - 2 slides, didn't fit in one screen. Our simple algorithm doesn't work good on 8 devices, but we see that it can be tuned and optimized.


We accidentally found out why the transfers on 200 Mb were so slow (see previous post). It is an impact of CPU device in a context. If the context contains 8 GPUs, everything works perfectly. If the context contain one GPU and one CPU, we get a very strange transfer rate loss when transfer 192-255Mb buffer to GPU.

A few words about new NVidia chip. This week specifications of new K20 compute processor were released. What do we see: 1.2 TFlop/s double precision with 200GB/s memory bandwidth. In fact, a-yer-ago AMD Radeon 7970 Ghz edition has the same flops and better bandwidth. In codes we do (CFD, quantum chemistry, cryptography and numbers theory, etc.) performance totally depends of memory subsystem speed. So, AMD is one generation forward NVidia in flops and bandwidth.

In conclusion. At the moment, we do dgemm just for fun in spare time (to test the framework we apply in real tasks) and in common GPGPU research purposes. The algorithms are very simple, but they demonstrate that these devices can show brilliant performance if properly supported with good software (drivers and etc).

Do the good drivers, AMD Team, and HPC sector will be yours!

From Russia with love,

Pavel, Anton.

Adept II

Re: OpenCL 8 GPU DGEMM (4,4 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications.

We send our greetings to everyone reading this topic.

Contents for today:

- Catalyst GPU drivers comparison: transfer speed & overlap, max memalloc and command queues

- Facts about cl_event management

- Overlap and multiple command queues

This post was inspired by the following 8 AMD FirePro S10000s machine:

That topic is about the 16 GPU cluster and 8,2 TFlop/s DGEMM launched on it. To achieve such a result, one should solve two major problems: hardware and software. First of all, the hardware should meet rather strict conditions: it should be able to launch operationg system with corrlectly initialized 16 GPU devices and it should provide enough PCIe speed to feed GPUs fast enough to hide all transfers behind calculations. Secondly, the AMD driver (the weakest link in chain) should be good enough to use all these powers and utilize PCIe bus effectively (speed + overlap). - on this photo you can see the specification of host side: 2 x Intel Xeon Processor 5500/ 5600 Series, which gives us 6 memory channels, ~64 GB/s theoretical peak host memory bandwidth. In real, ~80% of bandwidth can be achieved on linear read/write, it is ~51 GB/s. When transferring data to device, a temporary buffers are used, so we have to read and write from host memory => peak memcpy bandwidth is 51/2=25,5 GB/s. So, if there are 16 GPUs, 25,5/16=1,59 GB/s of host bandwidth per card. It is an ideal and unreachable situation with linear-only read/writes and peaceful GPU's (not fighting for host memory channels).


At the moment, full transfer path (see below)  from DDR cannot be faster than host memcpy, and if you transfer rectangles to several devices simultaneously, it's very difficult to get even this speed.

On our system with 8 GPUs (2xOpteron 6176 SE, 8 host memory channels, peak 80 GB/s bandwidth), peak bandwidth is 5GB/s per GPU. When launched simultaneously, real bandwidth is ~1.6 GB/s per GPU. So, we make the following conclusion: when we do 16 parallel memcpy's, peak bandwidth will be less than 1,59/2=800 MB/s - our tests confirm it.

Peak maxalloc on a device on all drivers is 2,7 GB. To do full overlap, double bufferization is required, so it comes for 450 MB per buffer. It gives us square block 7680x7680, which dgemms in 1.723s (on one of 16 GPUs, which give 0.7*750=525 GFlop/s). Data transfer is 3*450=1350 MB (sometimes an extra 450 MB store), which happens in 1,685s with 800 MB/s speed.

These numbers (1,68 & 1,72) are almost equal in the ideal scenario, when all involved OpenCL API calls give peak performance, overlap and etc. Dear developers (DEVELOPERS, DEVELOPERS, Steve Balmer), make your own conclusions.

We decided to clarify the question, which drivers are capable of doing such hard work. We have a program written in good-old C++, which runs some tests on an OpenCL device, and we have Radeon 7970. We have an src array in host memory, size 200 mb, allocated via malloc, possibly aligned (on these sizes we find no difference between aligned or not aligned memory). The task is to transfer it to device as a linear region or a rectangular (rect is n x n of elements in double precision, where n is max possible n: n*n*sizeof(double) <= 200Mb), and read data from device to src. Rectangular scenarios were built in the similar way to linear ones (1st uses EnqueueWriteBufferRect, 2nd uses EnqueueCopyBufferRect, 3rd and 4th uses Maps and lots of memcpys to copy rect as an array of rows). Speed was measured as a speed of full path including pinning, memcpy (where required) and etc. Memcpy is single-threaded.

A few words about optimal device usage. The maximum performance is achieved in the following case: we load the first block of data, then device launches compute kernels consequently without gaps, data transfers are totally hidden, and then we store results of the last kernel. See diagram below


Thereby, data transfer with 5th scenary isn't good for optimal usage with huge transfers, because we spend too much kernel time for this.

So, here is the table.






What can we see. The 4th scenario passes verification with non-standard usage: we map the buffer on a device ONCE in the beginning and unmap it ONCE in the end. An illustration:

mapped_mem = clEnqueueMapBuffer(persistent_buffer)

for (i = 0; i < niters; i++) {

memcpy (mapped_mem, src, 200Mb)

// after this memcpy call data is somehow ready on a device...




clUnmapMemObject(persistent_buffer, mapped_mem)

This is the simpliest usage case, and it shows brilliant results (great speed and full overlap), but two problems: non-standard & very small amount of persistent memory (real amount we can use never exceeds 130 MB, even if it is possible to allocate more). If memory is not AMD_PERSISTENT, verification on this case fails. Usage with mapping & unmapping AMD_PERSISTENT memory on each iteration show the same performance as 3rd scenario.

Full overlap is achieved on 3rd scenario in both ways (on low speed) and on 1st scenario on linear regions on high speed.


The final test we'd like to speak is the max number of available command queues. AMD driver doesn't allow to use more than 50 command queues total! Why it is important: when device is busy with calculations, AMD OpenCL driver manages cl_events in a strange way (see illustration). For example, if we do writing a small buffer and executing a huge compute kernel simultaneously, cl_event associated with writing becomes CL_COMPLETE only when kernel computation completes. So, if we put two transfers in one queue and wish to overlap them with a kernel, only one transfer will actually be overlapped. The second transfer would be done after kernel finishes.

[diag. cl_event dirung kernel]

As for example, for the toughest case of big DGEMM C <- a*A*B + b*C, we have to do three loads (next parts of matrices A, B, C) and one read (part of resulting C). So, to acheve full overlap we have to use 5 command queues (4 transfers and one kernel execution at the same time), or to merge parts of different host memory and transfer them as one piece, etc.

In most general case, one has to use at least two queues for transfer overlap (or three if do reading and writing) per device. With our DGEMM algorithm (5 queues per device), we use 5*8=40 queues for 8 devices.

If one uses 16 devices, max three queues per device can be allocated, and it is really not obvious, how to completly overlap several transfers...

As for a conclusion, a little remark. We see that new drivers work worse and worse. A year ago, when Radeons 7xxx were not presented, we had the same situation with drivers for Radeons 6970 & 6990. And there is an opinion (as we say in Russia) that AMD possibly wants to split graphics and computations: Radeons for graphics only, FirePro for GPGPU. It's a dark side of Force, and what we see in NVidia approves that.

So, farewell and may the Force be with you.

From Russia with love,

Pavel, Anton.


Re: OpenCL 8 GPU DGEMM (4,4 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications.

try experiment with GPU_MAX_COMMAND_QUEUES environment variable. it totaly undocumented and I just dig it up from amd opencl runtime.


Re: OpenCL 8 GPU DGEMM (4,4 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications.


Congratulations on the big driver tests you did, I'm sure it took for days with all those reinstalls.

I also did a smaller driver test and found out that my favorite and most reliable choice is 11.12_win7 and 12.2_linux32 + CAL (which is über-deprecated now).

My test was much simpler but it also produced that overlapping weirdness.

One job is like this: map, write, unmap, enqueueKernel, map, read, unmap    

The reads/writes were very small: 10KB, kernel took 250 millisecs.

To feed the ALUes everytime with jobs I uset two queues per gpu. Both of them did the above job alternating. When one job was about to finish I've launched the other. Did this update in a 20ms timer function, so the CPU usage was like 0% everytime.

I've measured, that with later drivers when you use more queues per gpu, clEnqueu calls became blocking and seems like it waits the other queue.

And if I try to update 2 gpues with 4 'twin' queues this will be worse: one queue halts the other one. Anyways I gave it up already, and returned back to old CAL, but for your project it's not a good option.

I remember that with 12.4? Somehow I've managed to do 99%+ dualgpu efficiency compared to singlegpu (with 0% CPU), but I cannot reproduce what voodo magic I did then. 😕  That was better dualgpu efficiency than the first CAL driver I'm using and trusting now. Now with my newer tests I cannot reproduce that.

Adept II

Re: OpenCL 8 GPU DGEMM (4,4 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications.

Nice work go on with your research. I have just a word to memcpy, my experince is from Windows but may it´s the same with linux. memcpy is blocking call that means if it done from different threads it is executed serial. I got better performance with simple copy loops that run on every core(I have dual octo core(16) XEON machine) they are done in parallel. Memcpy was only fast with one thread to feed one GPU, but I have only 5 GPU to feed. May it works also for you.

Adept II

Re: OpenCL 8 GPU DGEMM (4,4 TFlop/s double precision). Programming infrastructure for heterogeneous computing and its applications.

At the moment, we have the 12.11 beta 1 dirvers installed, and `strings /usr/lib/ | grep GPU` command gives us a list of flags, including GPU_MAX_COMMAND_QUEUES.

$ strings /usr/lib/ | grep GPU








































        and HD6XXX series GPU's only).

Generate 64-bit ELF binary for GPU (default: 32-bit)

Enable/disable float f/c ==> f * (1.0f/c) for GPU (default : on)

Disabling (-fno-inline) GPU inlining for testing


Virtual GPU List Ops Lock

GPU heap lock

Virtual GPU execution lock



Instruction writes to GPU memory.


But simple usage like setting environment variable takes no effect on these drivers. Even GPU_ASYNC_MEM_COPY (on 6xxx Radeon it was important to set it to achieve overlap) - overlap seems to work well when it is not set. If you have any ideas what do these flags actually mean and how to make them work, we'll be glad to use any information provided.