Tahiti unclear PCIe bandwidth limitation


I turns out that copy operations between the host and Tahiti device are quite some bottleneck (profiled using OpenCL profiling). I made a short test and got results which are not fully clear to me (rather slow copy rates) and I hence my question where the residual time is lost:

I create a C++ / OpenCL buffer of size n megabytes (1 MB here is 1024^2 char), and in a loop of q iterations copy data to or from the device, eg:

int memSize = n * 1024 * 1024;


for (int iter = 0; iter < q; ++iter)


   commandQu.enqueueWriteBuffer(buf, CL_TRUE, 0, memSize, mem, 0, 0);


and measure the total looping time for 200 iterations (so q = 200). For a 7970 at PCIe 2.0, x16 and RAM DDR3-1333 I get:

n = 32 (32 MB): 1.49 s

n = 64 (64 MB): 2.89 s

n = 128 (128 MB): 5.65 s

n = 256 (256 MB): 11.56 s

n = 512 (512 MB): 22.98 s

So the run-time is obviously almost a linear function of the buffer size, and the launch-overhead thus relatively small. However I had expected shorter runtimes: PCIe 2.0 at x16 has a bandwidth of 8 GB/s; for the last case (n = 512) we get 200 * 512 MB and hence under optimal conditions the copying should take only 12.5 - 13 seconds -> but I am almost double that figure.

Now the question: Where does the difference come from? Would I achieve 8 GB/s only if multiple host threads reading / writing to multiple buffers in parallel are used? If not, what else could make the difference? Is there some practical way to get closer to the theoretical 8 GB/s limit?

BTW, I also noted that if the OpenCL device is a CPU (here an Intel SB) copying times are closer to the theroretical limit (which should be around 10.6 GB/s IIRC) but still quite off, e.g. 14.70 s for the 512 MB iteration.

any hints much appreciated and thanks!

Re: Tahiti unclear PCIe bandwidth limitation

First off, it will be better if you can share your code.

Secondly it looks you are not using pinned buffers for data transfer. Try something like:


Re: Tahiti unclear PCIe bandwidth limitation

Also Please take a look at BufferBandwidth sample in AMD APP SDK