I’ve been experiencing some performance issues with OpenCL on AMD GPU’s. While OpenCL implementations seemed to be better than serial CPU code, I didn't think much about the performance until the Developer Preview of Visual Studio 11 came out. The runtimes of OpenCL on AMD GPU's is terrible compared to C++ AMP. I'm trying to understand why.
I wrote a program to help me quantify the runtime of OpenCL and C++ AMP, which is available here (requires MS Visual Studio 11 Developer Preview). It performs matrix multiplication across all platforms and devices in OpenCL, and all accelerators in C++ AMP, using the same algorithms. CUDA is not included in this implementation because it is not hardware independent, whereas OpenCL and C++ AMP are. This problem is AMD GPU specific; I do not see the same problem with NVIDIA cards (OpenCL and C++ AMP perform roughly equivalent).
For the comparison, I decided to use an AMD Llano machine (AMD A8-3850, Gigabyte GA-A75-UD4H M.B., 8GB memory, Windows 8 Developer Preview). This machine does not include NVIDIA software, only MS and the AMD APP SDK 2 version 1.2 (12/19/2011).
The output from the program is here:
has disp 1
dev AMD Radeon HD 6550D (Engineering Sample)
Starting serial... 1093.99 ms.
Starting serial... 1101.95 ms.
Starting serial... 1094.82 ms.
Starting simple... 70.6149 ms.
Starting simple... 60.582 ms.
Starting simple... 60.9937 ms.
Starting explicit... 58.338 ms.
Starting explicit... 58.1812 ms.
Starting explicit... 57.2286 ms.
Starting tile... 43.9958 ms.
Starting tile... 33.2602 ms.
Starting tile... 33.2874 ms.
has disp 0
dev Microsoft Basic Render Driver
Starting serial... 1082.92 ms.
Starting serial... 1083.16 ms.
Starting serial... 1083.2 ms.
Starting simple... 1418.89 ms.
Starting simple... 1420 ms.
Starting simple... 1412.86 ms.
Starting explicit... 1062.79 ms.
Starting explicit... 1056.32 ms.
Starting explicit... 1064.37 ms.
Starting tile... 883.104 ms.
Starting tile... 881.774 ms.
Starting tile... 863.205 ms.
has disp 1
dev Software Adapter
has disp 0
dev CPU accelerator
Number of platforms = 1
Platform profile: FULL_PROFILE
Platform version: OpenCL 1.1 AMD-APP (851.6)
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
Platform extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_
devices = 2
type = CL_DEVICE_TYPE_GPU
name = BeaverCreek
Starting serial... 1085.57 ms.
Starting serial... 1085.6 ms.
Starting serial... 1085.19 ms.
Starting explicit simple... 602.773 ms.
Starting explicit simple... 673.661 ms.
Starting explicit simple... 832.318 ms.
Starting tile... 659.472 ms.
Starting tile... 554.391 ms.
Starting tile... 581.086 ms.
type = CL_DEVICE_TYPE_CPU
name = AMD A8-3850 APU with Radeon(tm) HD Graphics
Starting serial... 1085.84 ms.
Starting serial... 1085.71 ms.
Starting serial... 1085.37 ms.
Starting explicit simple... 240.423 ms.
Starting explicit simple... 241.033 ms.
Starting explicit simple... 239.877 ms.
Starting tile... 1538.41 ms.
Starting tile... 1388.03 ms.
Starting tile... 2050.87 ms.
The results of the program indicate that C++ AMP typically computes the result of the multiplication of single precision floating point of input matrices A[450 rows, 640 cols] and B[640 rows, 960 cols] in 58 ms. In comparison, the OpenCL implementation solves the problem at best in 600 ms on the GPU. A CPU OpenCL device is enumerated in the program, and does surprisingly better than the GPU, running in 240 ms.
This doesn’t make sense because almost all variables have been eliminated:
- the size of the problem is the same;
- the size of the tiling (16 by 16) is the same;
- the allocation of device memory is the same;
- the copies to/from the CPU memory space and GPU memory space is the same;
- the kernel algorithms (of which there are two, “simple explicit” is a non-shared-memory implementation, “tiled” a shared-memory implementation) are the same;
- the device that appears as a GPU in both OpenCL and C++ AMP, which should be the same.
On the chance that clCreateProgramWithSource / clBuildProgram compilation is the culprit, I do not include the runtime of those two steps in the overall runtime. But, that does not help.
On another machine which has an NVIDIA card, there is neglible difference between OpenCL and C++ AMP, and CUDA (both runtime and driver implementations) for the NVIDIA GPU. However, that machine also has an AMD graphics card, which exhibits the same problem there as on the Llano machine: OpenCL for AMD GPU targets perform poorly compared to C++ AMP implementations that target the GPU.
I suspect that the reason for the poor performance is because C++ AMP targets DirectX11 which is implemented by the card itself, whereas OpenCL is translated into VLIW code. But, I don't know any details of how OpenCL is implemented for AMD GPU's. Or, it could be something simple that I've overlooked in the installation of the AMD APP SDK. I just don't know.
Comments would be appreciated.