cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sadrian
Adept I

Performance of a simple kernel does not scale on CPU

I have a kernel that simply multiplies elements of a large (512MB) buffer by a constant and puts the result into a new buffer. The kernel has 4 arguments: a pointer to the input buffer, a pointer to the output buffer, a float constant, and an integer number of reps. When I invoke clEnqueueNDRangeKernel, I set the global work size to be the number of threads that I want, and I set the reps kernel argument accordingly. So for a global work size of 1, I set the reps arg to the buffer size in floats. For a global work size of 2, I set the reps arg to half the buffer size in floats.

Running top in Linux, I can see that the number of threads does match the global work size that I set, each at 99% utilization. For a global work size of two, one thread operates on the first half of the buffer, while the second thread simultaneously (?) operates on the second half of the buffer. My problem is that the total throughput does not scale with the number of threads, even though I have dual 6-core Opterons:

1 thread: 900 MB/s

2 threads: 1100 MB/s

4 threads: 1300 MB/s

8 threads: 1300 MB/s

This surely can't be memory bandwidth limited. What am I missing?

0 Likes
5 Replies
realhet
Miniboss

Hi,

This test routine is more like a memory bandwidth test: It consists of 2 SP read/writes and 1 SP multiplies. On modern CPUs all of those can be done in a single clock 4 times (with sse and proper instruction overlapping).

Try the test on smaller buffer sizes, so the L2 and L1 caches can help it a lot. But still it's memory intense code: This can be done optimally on 1 cpu core.

"My problem is that the total throughput does not scale with the number of threads, even though I have dual 6-core Opterons:" -> Need to put 10 times more math in the algorithm and try to work locally in memory, that way you can use those lots of cpu's more effectively.

0 Likes
notzed
Challenger

This surely can't be memory bandwidth limited. What am I missing?

Your description is of a text-book case of a worst-case memory bandwidth limited algorithm.

So yes, it surely can be ...

0 Likes

It could be worse I suppose. He could have 2 threads pinned each to one of 2 sockets iterating with a stride of 64 bytes, but the second socket offset by 32 bytes, immediately after calling malloc on the buffer invoking the wrath of the cache coherency protocol while simultaneously getting nothing but cache misses, TLB misses, and page faults.

But in all seriousness, you're doing 1 load and 1 store for every floating point operation, which is most certainly memory bound.

sadrian
Adept I

I ran the same kernel (except for buffer initialization) on a Radeon 5870. I empirically found the sweet spot for the global work items to be 64k (higher numbers produced errors in the output). The result was 15 GB/s. In my program, I was also transferring the data back and forth between buffers over a sustained period, but I was only including the byte count in one direction, so the numbers should be doubled. If I take into account that each multiply requires both a load and a store, the numbers should be doubled again to determine memory bandwidth. The final numbers for the memory bandwidth used are:

CPU: 5.2 GB/s

GPU: 60 GB/s

Both of these numbers are within a factor of two of the benchmark memory bandwidth for these devices.

www.cs.virginia.edu/stream/peecee/Bandwidth.html

www.sisoftware.co.uk

0 Likes

Hi,

     this is just to say that we benchmarked a number of GPUs for the case you are consider. If you like you can find the results here

http://onlinelibrary.wiley.com/doi/10.1002/nme.3302/abstract

in any case i paste the relevant results here

Platform CPU GPU OS / GCC version

1 Intel Core 2 Duo E6600 NVIDIA GTX 260 10.04 / 4.4

2 AMD Phenom Quad core 9950 NVIDIA GTX 280 10.10 / 4.4

3 Intel Core 2 Quad Q9550 NVIDIA GTX 280 10.04 / 4.4

4 Intel Core i7 920 NVIDIA GTX 285 9.04 / 4.3

5 Intel Core i7 920 ATI Radeon HD 5870 10.04 / 4.4

Table VI. Memory bandwidth for CPU and graphical

processing unit on the benchmark platforms.

RAM bandwidth (GB/s)

Platform CPU (Add) GPU (Add) GPU (Copy)

1 2.11 61.86 95.14

2 7.62 71.16 118.16

3 5.03 71.44 119.49

4 13.50 80.77 130.95

5 16.53 80.62 117.60

hope this can be useful

Riccardo

0 Likes