cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mz24cn
Adept II

Why Hawaii/Spectre (R9 290X/A10-7870K R7) slower five ~ ten times than Intel/NVidia on atomic adding operations?

The code VS2013 project is attahced.

AMD driver version is 15.2 WHQL (OpenCL 2.0/1.2 1800.3).

The results are as below:

//NUM_PARALLEL = 1024;
Platform: AMD Accelerated Parallel Processing
Hawaii
27382 microseconds.
858994483       858996531       5120    7168
Spectre
25421 microseconds.
858994483       858996531       5120    7168
Cypress
24200 microseconds.
858994483       858996531       5120    7168

//NUM_PARALLEL = 1024000000;
Platform: AMD Accelerated Parallel Processing
Hawaii
7784847 microseconds.
1882993459      3930993459      825032704       2873032704
Spectre
64862811 microseconds.
1882993459      3930993459      825032704       2873032704
Cypress
144374737 microseconds.
1882993459      3930993459      825032704       2873032704

//NUM_PARALLEL = 1024;
Platform: NVIDIA CUDA
GeForce GTX 850M
2006 microseconds.
858994483       858996531       5120    7168

Platform: Intel(R) OpenCL
Intel(R) HD Graphics 4600
3788 microseconds.
858994483       858996531       5120    7168

//NUM_PARALLEL = 1024000000;
Platform: NVIDIA CUDA
GeForce GTX 850M
568615 microseconds.
1882993459      3930993459      825032704       2873032704

Platform: Intel(R) OpenCL
Intel(R) HD Graphics 4600
12791289 microseconds.
1882993459      3930993459      825032704       2873032704

GTX 850M only has 640 shaders while Hawaii has 2816.

0 Likes
8 Replies

Hi ,

In your program all million of work items are performing atomic operations on the same 4 DWORDs hence their work is slated to be serialized. In this kind of serialization scenario it make no difference how wide the GPU is, it is bottle-necked on atomic memory access to 4 DWORDs. It is like trying to breath through a straw while running marathon.

If your allocate more DWORDs for memory atomicity the performance will rapidly improve.

A couple of side notes:

  • Your NDRange size is determined based on 'CL_DEVICE_MAX_WORK_ITEM_SIZES'  parameter which may differ from one platform to another, hence you may be running different workloads on different platforms.
  • Your time measurement scope includes memory copies. Integrated graphics copy memory much faster than discrete GPUs because discrete GPUs have to go through the relatively slow PCI-e bus.

Tzachi

Hi Tzachi,

     Thank you, the official reply from AMD.

The code I demostrated is just used to comparing the atomicity effeciency since I'm optimizing some code by changing device memory pre-allocation pattern for each work item to dynamic allocations by each item itself which needs atomic addition of a shared pointer pointing free space. The code can be more flexible with the cost of atomicity. The testing code is just used to finding the tradeoffs.

I would like to discuss a few things on your reply.

As you say "In this kind of serialization scenario it make no difference how wide the GPU is".

Do you mean more stream processors are not conducive to atomicity operations? So Hawaii 2816 SMs are same as NVidia GTX850M 640 SMs? But it is still hard to understand why Hawaii is slow ten times than GTX850M, since GTX850M is a mobile GPU, also a discrete GPU, memory is DDR3, while Hawaii memory is the faster GDDR5. Is it related with CPU? GTX850M is installed on a laptop computer with Intel i7-4710MQ CPU/DDR3 1600, Hawaii is installed on a desktop computer with A10-7870K/A88 chipset/DDR3 1866.

As you say "Your time measurement scope includes memory copies. Integrated graphics copy memory much faster than discrete GPUs because discrete GPUs have to go through the relatively slow PCI-e bus."

Yes, I know it. It's just the strengh of APU/Intel core GPU. However I compare Hawaii with GTX850M, and Spectre(R7 series) with Intel HD4600, both the AMD devices get worse performance. I use memory map/ummap operations to show the addition results to prevent compiler deleting the code for optimizations. For large NUM_PARALLEL, the overhead of memory map/unmap opertions can be omitted.

As for "Your NDRange size is determined based on 'CL_DEVICE_MAX_WORK_ITEM_SIZES'  parameter", I think the code can maximize the performance for each GPU. Do you mean it is unfair for Hawaii? Could you please tell me how to set NDRange size which makes Hawaii running faster?

I know AMD and NVidia/Intel devices are based on different hardware architectures so it is normal showing performance difference on different aspects. But considering Hawaii is AMD fastest discrete GPU (except for Fury), A10-7870K is AMD fastest integrated GPU, both of them are highly slower than competitors that are not at the top, which is really out of my expectation.

Hi Tzachi,

A member (elavram, he has not been whitelisted yet) who PM me it is a driver issue only occurred in x86_64 version. x86 version is okay.

Could you please verify it?

0 Likes

I also compiled as 32bits version, unfortunately, the performance is same as 64bits version. I use driver 1800.3

elavram told me he observed 10 times faster in 32bits version. He use 1642.5 version.

0 Likes

If you need atomic access to 4 DWORDs you can use 'cl_ext_atomic_counters_32' extension. It uses on chip memory and it is an order of magnitude faster than global memory.

Edit - Did you verify you are launching the same amount of work items on all platforms?

0 Likes

No, I can not use atomic_inc with  'cl_ext_atomic_counters_32' extension. It cannot satisfy my needs.

As for "Edit - Did you verify you are launching the same amount of work items on all platforms?":

Yes, my code displays the calculated results which depends the total amount of work items. All platforms show same results in same NUM_PARALLEL (amount of work items) parameter.

BTW, the chip memory you referered is GPU cache memory?

0 Likes

When evaluating the serverity of this performance issue, it should be considered with a bug: Re: Bug report: Hawaii returns no results when printf removed  tzachi.cohen

0 Likes
elavram
Adept I

Hello!

I made some measurements using two different drivers. I run your executables and saw some differences. Maybe it is only in the case of my system. Runtime in milliseconds. I made another post describing the same in my kernels. GPU: Firepro W8100 (Hawaii). See the small increase in runtime, VGPRs and SGPRs.

   

x86x64
Driver versionruntimeVGPRsSGPRswavesoccupacyruntimeVGPRsSGPRswavesoccupacy
13.352.10141.2-1411.49256035134100%9111445134100%
14.502.10192.0-1642.59657336134100%11006719134100%
0 Likes