In your program all million of work items are performing atomic operations on the same 4 DWORDs hence their work is slated to be serialized. In this kind of serialization scenario it make no difference how wide the GPU is, it is bottle-necked on atomic memory access to 4 DWORDs. It is like trying to breath through a straw while running marathon.
If your allocate more DWORDs for memory atomicity the performance will rapidly improve.
A couple of side notes:
- Your NDRange size is determined based on 'CL_DEVICE_MAX_WORK_ITEM_SIZES' parameter which may differ from one platform to another, hence you may be running different workloads on different platforms.
- Your time measurement scope includes memory copies. Integrated graphics copy memory much faster than discrete GPUs because discrete GPUs have to go through the relatively slow PCI-e bus.
1 of 1 people found this helpful
Thank you, the official reply from AMD.
The code I demostrated is just used to comparing the atomicity effeciency since I'm optimizing some code by changing device memory pre-allocation pattern for each work item to dynamic allocations by each item itself which needs atomic addition of a shared pointer pointing free space. The code can be more flexible with the cost of atomicity. The testing code is just used to finding the tradeoffs.
I would like to discuss a few things on your reply.
As you say "In this kind of serialization scenario it make no difference how wide the GPU is".
Do you mean more stream processors are not conducive to atomicity operations? So Hawaii 2816 SMs are same as NVidia GTX850M 640 SMs? But it is still hard to understand why Hawaii is slow ten times than GTX850M, since GTX850M is a mobile GPU, also a discrete GPU, memory is DDR3, while Hawaii memory is the faster GDDR5. Is it related with CPU? GTX850M is installed on a laptop computer with Intel i7-4710MQ CPU/DDR3 1600, Hawaii is installed on a desktop computer with A10-7870K/A88 chipset/DDR3 1866.
As you say "Your time measurement scope includes memory copies. Integrated graphics copy memory much faster than discrete GPUs because discrete GPUs have to go through the relatively slow PCI-e bus."
Yes, I know it. It's just the strengh of APU/Intel core GPU. However I compare Hawaii with GTX850M, and Spectre(R7 series) with Intel HD4600, both the AMD devices get worse performance. I use memory map/ummap operations to show the addition results to prevent compiler deleting the code for optimizations. For large NUM_PARALLEL, the overhead of memory map/unmap opertions can be omitted.
As for "Your NDRange size is determined based on 'CL_DEVICE_MAX_WORK_ITEM_SIZES' parameter", I think the code can maximize the performance for each GPU. Do you mean it is unfair for Hawaii? Could you please tell me how to set NDRange size which makes Hawaii running faster?
I know AMD and NVidia/Intel devices are based on different hardware architectures so it is normal showing performance difference on different aspects. But considering Hawaii is AMD fastest discrete GPU (except for Fury), A10-7870K is AMD fastest integrated GPU, both of them are highly slower than competitors that are not at the top, which is really out of my expectation.
A member (elavram, he has not been whitelisted yet) who PM me it is a driver issue only occurred in x86_64 version. x86 version is okay.
Could you please verify it?
I also compiled as 32bits version, unfortunately, the performance is same as 64bits version. I use driver 1800.3
elavram told me he observed 10 times faster in 32bits version. He use 1642.5 version.
If you need atomic access to 4 DWORDs you can use 'cl_ext_atomic_counters_32' extension. It uses on chip memory and it is an order of magnitude faster than global memory.
Edit - Did you verify you are launching the same amount of work items on all platforms?
No, I can not use atomic_inc with 'cl_ext_atomic_counters_32' extension. It cannot satisfy my needs.
As for "Edit - Did you verify you are launching the same amount of work items on all platforms?":
Yes, my code displays the calculated results which depends the total amount of work items. All platforms show same results in same NUM_PARALLEL (amount of work items) parameter.
BTW, the chip memory you referered is GPU cache memory?
I made some measurements using two different drivers. I run your executables and saw some differences. Maybe it is only in the case of my system. Runtime in milliseconds. I made another post describing the same in my kernels. GPU: Firepro W8100 (Hawaii). See the small increase in runtime, VGPRs and SGPRs.
x86 x64 Driver version runtime VGPRs SGPRs waves occupacy runtime VGPRs SGPRs waves occupacy 13.352.1014 1.2-1411.4 925603 5 13 4 100% 911144 5 13 4 100% 14.502.1019 2.0-1642.5 965733 6 13 4 100% 1100671 9 13 4 100%