I have a kernel that simply multiplies elements of a large (512MB) buffer by a constant and puts the result into a new buffer. The kernel has 4 arguments: a pointer to the input buffer, a pointer to the output buffer, a float constant, and an integer number of reps. When I invoke clEnqueueNDRangeKernel, I set the global work size to be the number of threads that I want, and I set the reps kernel argument accordingly. So for a global work size of 1, I set the reps arg to the buffer size in floats. For a global work size of 2, I set the reps arg to half the buffer size in floats.
Running top in Linux, I can see that the number of threads does match the global work size that I set, each at 99% utilization. For a global work size of two, one thread operates on the first half of the buffer, while the second thread simultaneously (?) operates on the second half of the buffer. My problem is that the total throughput does not scale with the number of threads, even though I have dual 6-core Opterons:
1 thread: 900 MB/s
2 threads: 1100 MB/s
4 threads: 1300 MB/s
8 threads: 1300 MB/s
This surely can't be memory bandwidth limited. What am I missing?