AnsweredAssumed Answered

How to get consistent results with AMD APU and GPU hardware vs others?

Question asked by mfried on Jun 26, 2012
Latest reply on Jun 28, 2012 by yurtesen

There's a lot of data presented in this question below, but the fundamental question I'm trying to ask is this:

Is there any good, consistent way to measure performance with AMD GPUs in OpenCL? Are there any suggestions regarding buffer sizes, amount of memory, type of memory, etc, etc that someone can recommend? Perhaps I'm doing some things wrong... But here are some measurements on a variety of devices for a simple kernel...

 

I'm writing a number of benchmarking kernels and running them on multiple GPUs and CPUs. One of them is a global memory writing bandwidth test. Here is the kernel used in the tests below:

__kernel void fillBuffer4( __global float4* __restrict buffer )
{
   int i = get_global_id( 0 );
   buffer[ i ] = 0;
}

 

The kernel is executed on a work group of size 256x1x1 for all devices. I produce a bunch of "max allocation size" buffers or 256MiB buffers, (whichever number is smaller) and the kernel fills these with zeroes on the devices. My devices include a bunch of different GPUs from NVIDIA and AMD as well as CPUs from Intel and AMD sometimes running both AMD and Intel OpenCL implementations on the same CPUs.

 

On my local workstation under Windows 7 x64, I have 3 GPUs, listed in OpenCL order, my results look like this:

Device: Tesla C2075
Vendor: NVIDIA Corporation VendorId: 0x10DE
Version: OpenCL 1.1 CUDA
Memory: 5.25 GiB
Device bandwidth: 117.56 GB/s on chunk 1/20 in 0.002283392s
... 13 other values in the same area ...
Device bandwidth: 118.3 GB/s on chunk 15/20 in 0.00226912s
Failed to allocate all buffers.
Allocated 3.75 GiB memory.
Aggregate Device bandwidth: 118.77GB/s

(apparently, not all 5.25GB of the ECC memory is visible to OpenCL here, but I get about 118GB/s consistently)

 

Device: Tesla C1060
Vendor: NVIDIA Corporation VendorId: 0x10DE
Version: OpenCL 1.0 CUDA
Memory: 4.00 GiB
Device bandwidth: 75.42 GB/s on chunk 1/15 in 0.00355936s
...
Device bandwidth: 75.61 GB/s on chunk 15/15 in 0.003550112s
All buffers allocated successfully.
Allocated 3.75 GiB memory.
Aggregate Device bandwidth: 75.75GB/s

(In this case, the 4.00GiB was just enough fewer bytes so that I couldn't allocate the last 256MiB, but it rounded the decimal value up to 4.00).

 

Cypress here = Radeon HD 5870:

Device: Cypress
Vendor: Advanced Micro Devices, Inc. VendorId: 0x1002
Version: OpenCL 1.1 AMD-APP (831.4)
Memory: 1 GiB
Device bandwidth: 30.43 GB/s on chunk 1/4 in 0.008821111s
Device bandwidth: .17 GB/s on chunk 2/4 in 1.575923667s
Device bandwidth: .17 GB/s on chunk 3/4 in 1.578026888s
Device bandwidth: .17 GB/s on chunk 4/4 in 1.561497444s

 

Hrm... That's odd... Let's run that again:

Device bandwidth: 68.03 GB/s on chunk 1/4 in 0.003945555s
Device bandwidth: .18 GB/s on chunk 2/4 in 1.495408555s
Device bandwidth: .18 GB/s on chunk 3/4 in 1.483596334s
Device bandwidth: 68.31 GB/s on chunk 4/4 in 0.003929889s

 

and again:

Device bandwidth: 68.59 GB/s on chunk 1/4 in 0.003913778s
Device bandwidth: .18 GB/s on chunk 2/4 in 1.491942777s
Device bandwidth: .18 GB/s on chunk 3/4 in 1.488972111s
Device bandwidth: .18 GB/s on chunk 4/4 in 1.488551667s

 

(most other runs behaved similarly to the second run with two values near 68 and the other two near .18)

 

Ok, perhaps that's just the Radeon HD 5870 or Windows was my next thought, so I ran this on other GPUs on my Linux boxes (the timings were reported in nanoseconds instead of seconds from the slightly different build of the code):

Device: GeForce GTX 580
Vendor: NVIDIA Corporation VendorId: 0x10DE
Version: OpenCL 1.1 CUDA
Memory: 3.00 GiB
Device bandwidth: 176.42 GB/s on chunk 1/11 in 1521536 NS
... 9 other values in the same area ...
Device bandwidth: 177.62 GB/s on chunk 11/11 in 1511328 NS
All buffers allocated successfully.
Allocated 2.75 GiB memory.
Aggregate Device bandwidth: 177.53GB/s

 

And...

Device: GeForce GTX 560 Ti
Vendor: NVIDIA Corporation VendorId: 0x10DE
Version: OpenCL 1.1 CUDA
Memory: 2.00 GiB
Device bandwidth: 117.94 GB/s on chunk 1/7 in 2276000 NS
... 5 other values in the same area ...
Device bandwidth: 117.79 GB/s on chunk 7/7 in 2279008 NS
All buffers allocated successfully.
Allocated 1.75 GiB memory.
Aggregate Device bandwidth: 117.72GB/s

 

Cayman here = Radeon HD 6970:

Device: Cayman
Vendor: Advanced Micro Devices, Inc. VendorId: 0x1002
Version: OpenCL 1.2 AMD-APP (923.1)
Memory: 1 GiB
Device bandwidth: 19.1 GB/s on chunk 1/4 in 14057111 NS
Device bandwidth: 112.88 GB/s on chunk 2/4 in 2378111 NS
Device bandwidth: 133.91 GB/s on chunk 3/4 in 2004666 NS
Device bandwidth: 134.25 GB/s on chunk 4/4 in 1999445 NS
All buffers allocated successfully.
Allocated 1 GiB memory.
Aggregate Device bandwidth: 52.53GB/s

 

And running it again:

Device bandwidth: 23.98 GB/s on chunk 1/4 in 11195112 NS
Device bandwidth: 113.03 GB/s on chunk 2/4 in 2374889 NS
Device bandwidth: 135.37 GB/s on chunk 3/4 in 1983000 NS
Device bandwidth: 132.6 GB/s on chunk 4/4 in 2024334 NS

 

And again...

Device bandwidth: 24.91 GB/s on chunk 1/4 in 10776333 NS
Device bandwidth: 112.99 GB/s on chunk 2/4 in 2375667 NS
Device bandwidth: 133.24 GB/s on chunk 3/4 in 2014667 NS
Device bandwidth: 105.85 GB/s on chunk 4/4 in 2536000 NS

 

So my observation from running this program over and over again on different devices is that from NVIDIA devices I appear to get consistent performance, and from AMD devices, I'm getting a mix of good and bad values. The _good_ news appears to be for the Radeon HD 7970 that I'm only seeing one outlier, and it's not too bad of an outlier (perhaps associated with device warm-up or IL finalization on first launch or something):

Device: Tahiti
Vendor: Advanced Micro Devices, Inc. VendorId: 0x1002
Version: OpenCL 1.2 AMD-APP (923.1)
Profile: FULL_PROFILE
Memory: 2 GiB
Device bandwidth: 125.29 GB/s on chunk 1/8 in 2142519 NS
Device bandwidth: 152.23 GB/s on chunk 2/8 in 1763407 NS
Device bandwidth: 152.11 GB/s on chunk 3/8 in 1764741 NS
Device bandwidth: 150.11 GB/s on chunk 4/8 in 1788296 NS
Device bandwidth: 151.64 GB/s on chunk 5/8 in 1770222 NS
Device bandwidth: 151.47 GB/s on chunk 6/8 in 1772148 NS
Device bandwidth: 150.39 GB/s on chunk 7/8 in 1784889 NS
Device bandwidth: 143.58 GB/s on chunk 8/8 in 1869629 NS
All buffers allocated successfully.
Allocated 2 GiB memory.
Aggregate Device bandwidth: 146.53GB/s

 

A different run:

Device bandwidth: 118.2 GB/s on chunk 1/8 in 2271111 NS
Device bandwidth: 151.17 GB/s on chunk 2/8 in 1775704 NS
Device bandwidth: 147.67 GB/s on chunk 3/8 in 1817778 NS
Device bandwidth: 151.41 GB/s on chunk 4/8 in 1772888 NS
Device bandwidth: 151.56 GB/s on chunk 5/8 in 1771111 NS
Device bandwidth: 151.44 GB/s on chunk 6/8 in 1772593 NS
Device bandwidth: 151.72 GB/s on chunk 7/8 in 1769333 NS
Device bandwidth: 142.28 GB/s on chunk 8/8 in 1886667 NS

 

CPU Bandwidths here for 2x Intel Xeon E5-2643 are about 9.3GB/s from AMD's OpenCL and 18.8GB/s using Intel's OpenCL. The 2x Opteron 6274 / AMD OpenCL doing the same test has a value around 1.6GB/s. On my Windows 7 x64 workstation with 2x Intel Xeon 5570 CPUs, the values drop from 2.7 GB/s for the first 2GB to 1.9 GB/s for the last 2GB running Intel's OpenCL and from 2.9GB/s to 0.55GB/s using AMD's OpenCL. On the A8-3870K APU, the results are 1.3GB/s treating the device as a CPU, and as the max allocated buffer size is only 128MiB out of the 256MiB of GPU dedicated DDR3, those results come out to two numbers in the area of 9.6GB/s and 31GB/s when using the APU as a GPU (multiple runs show different behaviors, but most show about these 2 values for the two half of full memory buffers). My workstation has 12GB DDR3 RAM, but all the other CPUs had at least 32GB of DDR3 memory, and all the CPU device tests were limited to 4GB (primarily to avoid OS swapping to disk).

 

So... I'm trying to fairly and consistently measure all of these devices as compute resources. I have other benchmarks, and mechanisms to run these benchmarks on all the devices, but the real question is: Is there something special required to do to get consistent results? In an ideal world, I'd like for people to run their benchmarks against all these nodes and be able to expect more consistency in their results than I have been getting so far on this simple kernel.

Outcomes