I turns out that copy operations between the host and Tahiti device are quite some bottleneck (profiled using OpenCL profiling). I made a short test and got results which are not fully clear to me (rather slow copy rates) and I hence my question where the residual time is lost:
I create a C++ / OpenCL buffer of size n megabytes (1 MB here is 1024^2 char), and in a loop of q iterations copy data to or from the device, eg:
int memSize = n * 1024 * 1024;
for (int iter = 0; iter < q; ++iter)
commandQu.enqueueWriteBuffer(buf, CL_TRUE, 0, memSize, mem, 0, 0);
and measure the total looping time for 200 iterations (so q = 200). For a 7970 at PCIe 2.0, x16 and RAM DDR3-1333 I get:
n = 32 (32 MB): 1.49 s
n = 64 (64 MB): 2.89 s
n = 128 (128 MB): 5.65 s
n = 256 (256 MB): 11.56 s
n = 512 (512 MB): 22.98 s
So the run-time is obviously almost a linear function of the buffer size, and the launch-overhead thus relatively small. However I had expected shorter runtimes: PCIe 2.0 at x16 has a bandwidth of 8 GB/s; for the last case (n = 512) we get 200 * 512 MB and hence under optimal conditions the copying should take only 12.5 - 13 seconds -> but I am almost double that figure.
Now the question: Where does the difference come from? Would I achieve 8 GB/s only if multiple host threads reading / writing to multiple buffers in parallel are used? If not, what else could make the difference? Is there some practical way to get closer to the theoretical 8 GB/s limit?
BTW, I also noted that if the OpenCL device is a CPU (here an Intel SB) copying times are closer to the theroretical limit (which should be around 10.6 GB/s IIRC) but still quite off, e.g. 14.70 s for the 512 MB iteration.
any hints much appreciated and thanks!