This test routine is more like a memory bandwidth test: It consists of 2 SP read/writes and 1 SP multiplies. On modern CPUs all of those can be done in a single clock 4 times (with sse and proper instruction overlapping).
Try the test on smaller buffer sizes, so the L2 and L1 caches can help it a lot. But still it's memory intense code: This can be done optimally on 1 cpu core.
"My problem is that the total throughput does not scale with the number of threads, even though I have dual 6-core Opterons:" -> Need to put 10 times more math in the algorithm and try to work locally in memory, that way you can use those lots of cpu's more effectively.
This surely can't be memory bandwidth limited. What am I missing?
Your description is of a text-book case of a worst-case memory bandwidth limited algorithm.
So yes, it surely can be ...
1 of 1 people found this helpful
It could be worse I suppose. He could have 2 threads pinned each to one of 2 sockets iterating with a stride of 64 bytes, but the second socket offset by 32 bytes, immediately after calling malloc on the buffer invoking the wrath of the cache coherency protocol while simultaneously getting nothing but cache misses, TLB misses, and page faults.
But in all seriousness, you're doing 1 load and 1 store for every floating point operation, which is most certainly memory bound.
I ran the same kernel (except for buffer initialization) on a Radeon 5870. I empirically found the sweet spot for the global work items to be 64k (higher numbers produced errors in the output). The result was 15 GB/s. In my program, I was also transferring the data back and forth between buffers over a sustained period, but I was only including the byte count in one direction, so the numbers should be doubled. If I take into account that each multiply requires both a load and a store, the numbers should be doubled again to determine memory bandwidth. The final numbers for the memory bandwidth used are:
CPU: 5.2 GB/s
GPU: 60 GB/s
Both of these numbers are within a factor of two of the benchmark memory bandwidth for these devices.
this is just to say that we benchmarked a number of GPUs for the case you are consider. If you like you can find the results here
in any case i paste the relevant results here
Platform CPU GPU OS / GCC version
1 Intel Core 2 Duo E6600 NVIDIA GTX 260 10.04 / 4.4
2 AMD Phenom Quad core 9950 NVIDIA GTX 280 10.10 / 4.4
3 Intel Core 2 Quad Q9550 NVIDIA GTX 280 10.04 / 4.4
4 Intel Core i7 920 NVIDIA GTX 285 9.04 / 4.3
5 Intel Core i7 920 ATI Radeon HD 5870 10.04 / 4.4
Table VI. Memory bandwidth for CPU and graphical
processing unit on the benchmark platforms.
RAM bandwidth (GB/s)
Platform CPU (Add) GPU (Add) GPU (Copy)
1 2.11 61.86 95.14
2 7.62 71.16 118.16
3 5.03 71.44 119.49
4 13.50 80.77 130.95
5 16.53 80.62 117.60
hope this can be useful