cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mfried
Adept II

How to get consistent results with AMD APU and GPU hardware vs others?

There's a lot of data presented in this question below, but the fundamental question I'm trying to ask is this:

Is there any good, consistent way to measure performance with AMD GPUs in OpenCL? Are there any suggestions regarding buffer sizes, amount of memory, type of memory, etc, etc that someone can recommend? Perhaps I'm doing some things wrong... But here are some measurements on a variety of devices for a simple kernel...

I'm writing a number of benchmarking kernels and running them on multiple GPUs and CPUs. One of them is a global memory writing bandwidth test. Here is the kernel used in the tests below:

__kernel void fillBuffer4( __global float4* __restrict buffer )

{

   int i = get_global_id( 0 );

   buffer[ i ] = 0;

}

The kernel is executed on a work group of size 256x1x1 for all devices. I produce a bunch of "max allocation size" buffers or 256MiB buffers, (whichever number is smaller) and the kernel fills these with zeroes on the devices. My devices include a bunch of different GPUs from NVIDIA and AMD as well as CPUs from Intel and AMD sometimes running both AMD and Intel OpenCL implementations on the same CPUs.

On my local workstation under Windows 7 x64, I have 3 GPUs, listed in OpenCL order, my results look like this:

Device: Tesla C2075

Vendor: NVIDIA Corporation VendorId: 0x10DE

Version: OpenCL 1.1 CUDA

Memory: 5.25 GiB

Device bandwidth: 117.56 GB/s on chunk 1/20 in 0.002283392s

... 13 other values in the same area ...

Device bandwidth: 118.3 GB/s on chunk 15/20 in 0.00226912s

Failed to allocate all buffers.

Allocated 3.75 GiB memory.

Aggregate Device bandwidth: 118.77GB/s

(apparently, not all 5.25GB of the ECC memory is visible to OpenCL here, but I get about 118GB/s consistently)

Device: Tesla C1060

Vendor: NVIDIA Corporation VendorId: 0x10DE

Version: OpenCL 1.0 CUDA

Memory: 4.00 GiB

Device bandwidth: 75.42 GB/s on chunk 1/15 in 0.00355936s

...

Device bandwidth: 75.61 GB/s on chunk 15/15 in 0.003550112s

All buffers allocated successfully.

Allocated 3.75 GiB memory.

Aggregate Device bandwidth: 75.75GB/s

(In this case, the 4.00GiB was just enough fewer bytes so that I couldn't allocate the last 256MiB, but it rounded the decimal value up to 4.00).

Cypress here = Radeon HD 5870:

Device: Cypress

Vendor: Advanced Micro Devices, Inc. VendorId: 0x1002

Version: OpenCL 1.1 AMD-APP (831.4)

Memory: 1 GiB

Device bandwidth: 30.43 GB/s on chunk 1/4 in 0.008821111s

Device bandwidth: .17 GB/s on chunk 2/4 in 1.575923667s

Device bandwidth: .17 GB/s on chunk 3/4 in 1.578026888s

Device bandwidth: .17 GB/s on chunk 4/4 in 1.561497444s

Hrm... That's odd... Let's run that again:

Device bandwidth: 68.03 GB/s on chunk 1/4 in 0.003945555s

Device bandwidth: .18 GB/s on chunk 2/4 in 1.495408555s

Device bandwidth: .18 GB/s on chunk 3/4 in 1.483596334s

Device bandwidth: 68.31 GB/s on chunk 4/4 in 0.003929889s

and again:

Device bandwidth: 68.59 GB/s on chunk 1/4 in 0.003913778s

Device bandwidth: .18 GB/s on chunk 2/4 in 1.491942777s

Device bandwidth: .18 GB/s on chunk 3/4 in 1.488972111s

Device bandwidth: .18 GB/s on chunk 4/4 in 1.488551667s

(most other runs behaved similarly to the second run with two values near 68 and the other two near .18)

Ok, perhaps that's just the Radeon HD 5870 or Windows was my next thought, so I ran this on other GPUs on my Linux boxes (the timings were reported in nanoseconds instead of seconds from the slightly different build of the code):

Device: GeForce GTX 580

Vendor: NVIDIA Corporation VendorId: 0x10DE

Version: OpenCL 1.1 CUDA

Memory: 3.00 GiB

Device bandwidth: 176.42 GB/s on chunk 1/11 in 1521536 NS

... 9 other values in the same area ...

Device bandwidth: 177.62 GB/s on chunk 11/11 in 1511328 NS

All buffers allocated successfully.

Allocated 2.75 GiB memory.

Aggregate Device bandwidth: 177.53GB/s

And...

Device: GeForce GTX 560 Ti

Vendor: NVIDIA Corporation VendorId: 0x10DE

Version: OpenCL 1.1 CUDA

Memory: 2.00 GiB

Device bandwidth: 117.94 GB/s on chunk 1/7 in 2276000 NS

... 5 other values in the same area ...

Device bandwidth: 117.79 GB/s on chunk 7/7 in 2279008 NS

All buffers allocated successfully.

Allocated 1.75 GiB memory.

Aggregate Device bandwidth: 117.72GB/s

Cayman here = Radeon HD 6970:

Device: Cayman

Vendor: Advanced Micro Devices, Inc. VendorId: 0x1002

Version: OpenCL 1.2 AMD-APP (923.1)

Memory: 1 GiB

Device bandwidth: 19.1 GB/s on chunk 1/4 in 14057111 NS

Device bandwidth: 112.88 GB/s on chunk 2/4 in 2378111 NS

Device bandwidth: 133.91 GB/s on chunk 3/4 in 2004666 NS

Device bandwidth: 134.25 GB/s on chunk 4/4 in 1999445 NS

All buffers allocated successfully.

Allocated 1 GiB memory.

Aggregate Device bandwidth: 52.53GB/s

And running it again:

Device bandwidth: 23.98 GB/s on chunk 1/4 in 11195112 NS

Device bandwidth: 113.03 GB/s on chunk 2/4 in 2374889 NS

Device bandwidth: 135.37 GB/s on chunk 3/4 in 1983000 NS

Device bandwidth: 132.6 GB/s on chunk 4/4 in 2024334 NS

And again...

Device bandwidth: 24.91 GB/s on chunk 1/4 in 10776333 NS

Device bandwidth: 112.99 GB/s on chunk 2/4 in 2375667 NS

Device bandwidth: 133.24 GB/s on chunk 3/4 in 2014667 NS

Device bandwidth: 105.85 GB/s on chunk 4/4 in 2536000 NS

So my observation from running this program over and over again on different devices is that from NVIDIA devices I appear to get consistent performance, and from AMD devices, I'm getting a mix of good and bad values. The _good_ news appears to be for the Radeon HD 7970 that I'm only seeing one outlier, and it's not too bad of an outlier (perhaps associated with device warm-up or IL finalization on first launch or something):

Device: Tahiti

Vendor: Advanced Micro Devices, Inc. VendorId: 0x1002

Version: OpenCL 1.2 AMD-APP (923.1)

Profile: FULL_PROFILE

Memory: 2 GiB

Device bandwidth: 125.29 GB/s on chunk 1/8 in 2142519 NS

Device bandwidth: 152.23 GB/s on chunk 2/8 in 1763407 NS

Device bandwidth: 152.11 GB/s on chunk 3/8 in 1764741 NS

Device bandwidth: 150.11 GB/s on chunk 4/8 in 1788296 NS

Device bandwidth: 151.64 GB/s on chunk 5/8 in 1770222 NS

Device bandwidth: 151.47 GB/s on chunk 6/8 in 1772148 NS

Device bandwidth: 150.39 GB/s on chunk 7/8 in 1784889 NS

Device bandwidth: 143.58 GB/s on chunk 8/8 in 1869629 NS

All buffers allocated successfully.

Allocated 2 GiB memory.

Aggregate Device bandwidth: 146.53GB/s

A different run:

Device bandwidth: 118.2 GB/s on chunk 1/8 in 2271111 NS

Device bandwidth: 151.17 GB/s on chunk 2/8 in 1775704 NS

Device bandwidth: 147.67 GB/s on chunk 3/8 in 1817778 NS

Device bandwidth: 151.41 GB/s on chunk 4/8 in 1772888 NS

Device bandwidth: 151.56 GB/s on chunk 5/8 in 1771111 NS

Device bandwidth: 151.44 GB/s on chunk 6/8 in 1772593 NS

Device bandwidth: 151.72 GB/s on chunk 7/8 in 1769333 NS

Device bandwidth: 142.28 GB/s on chunk 8/8 in 1886667 NS

CPU Bandwidths here for 2x Intel Xeon E5-2643 are about 9.3GB/s from AMD's OpenCL and 18.8GB/s using Intel's OpenCL. The 2x Opteron 6274 / AMD OpenCL doing the same test has a value around 1.6GB/s. On my Windows 7 x64 workstation with 2x Intel Xeon 5570 CPUs, the values drop from 2.7 GB/s for the first 2GB to 1.9 GB/s for the last 2GB running Intel's OpenCL and from 2.9GB/s to 0.55GB/s using AMD's OpenCL. On the A8-3870K APU, the results are 1.3GB/s treating the device as a CPU, and as the max allocated buffer size is only 128MiB out of the 256MiB of GPU dedicated DDR3, those results come out to two numbers in the area of 9.6GB/s and 31GB/s when using the APU as a GPU (multiple runs show different behaviors, but most show about these 2 values for the two half of full memory buffers). My workstation has 12GB DDR3 RAM, but all the other CPUs had at least 32GB of DDR3 memory, and all the CPU device tests were limited to 4GB (primarily to avoid OS swapping to disk).

So... I'm trying to fairly and consistently measure all of these devices as compute resources. I have other benchmarks, and mechanisms to run these benchmarks on all the devices, but the real question is: Is there something special required to do to get consistent results? In an ideal world, I'd like for people to run their benchmarks against all these nodes and be able to expect more consistency in their results than I have been getting so far on this simple kernel.

0 Likes
3 Replies
notzed
Challenger

You don't specify how you're timing it.

TBH I get pretty variable results when timing on AMD as well, whereas I tended to see almost deterministic results on nvidia back when i used their cards.  But the differences were only small - a few percentage points.  One thing I noticed is that the AMD cards I've had recently are more dependent on the order of execution, which suggests they are more aggressive about maintaining the global memory cache (and not, e.g. flushing it at each kernel invocation).  The way AMD queues up jobs for execution seems much more dependent on what's going on in the system too (interrupts, context switches and so on).

Micro-benchmarks like this are very tricky things to get right, but these wild differences look like you're timing it incorrectly.

0 Likes
Wenju
Elite

Hi mfried,

The more threads you set, the more precise bandwidth you'll get.(Someone has told me this.). And the type of memory is up to your needs, and you must know the benchmark samples in AMD APP SDK, and you will find the constantbandwidth, LDSbandwidth and so on. And the results are different on different devices. Maybe it's consistent on hd7970, but on A10, it's not.

Thank you.

0 Likes
yurtesen
Miniboss

Can you attach your test program for download? (I can run Linux versions...)

Did you check if your cards are changing to max speeds when you run the benchmarks? (with gpu-z on windows for example?)

Can the type of objects be relevant? what options are you using?

I get consistent results from the SDK examples as far as I remember on Cypress and Tahiti...

0 Likes