I have a very simple program. The data is transferred into GPU memory in the beginning of the
program, and the main program just queues kernel runs with different starting offsets (waits for
previous run to finish before queing another run). When the whole range is executed, it reads the
results from the card.
It is sad because AMD's GCN and even older cards beat Nvidia counterparts greatly in performance of these calculations, yet something simply do not function properly so we are forced to use Nvidia hardware
Problem1:
The program enqueues a kernel run, wait for it to finish and enqueue another. There is significant time when starting kernel runs (on Tahiti) even though there shouldnt be any data transfer at all.
A case with global size of 430932 (rounded up to 431104) takes 36.7 seconds
to run when the kernel is enqueued once. If the kernel is enqueued with global size of 50000 and using offsets (rounded to 50176) and run in 9 pieces, the total runtime is 44.2 seconds. The overhead is almost a second per kernel enqueue on Tahiti. On Cypress the difference is only 42.5 vs 44 seconds
Note: In the program, block size is set from defs.h file using variable WORKSIZE
Problem2:
This works fine for example if the whole range was executed all at once. Lets say 0 to 4000000 all at once,
but if I try 0-50000, 50000-100000, 150000-200000 etc. then after 8-10 enqueues it gets stuck and the only
way to recover is rebooting the box. (yes the global size was rounded up to multiple of worksize)
This happens only with Tahiti and NOT with for example Cypress (5870) or Nvidia Tesla cards.
I am providing the source code if anybody wants to have a look:
Program:
http://users.abo.fi/eyurtese/amd/galaxyz.tgz
Data Files:
http://users.abo.fi/eyurtese/amd/galaxy_data.tgz
The program includes a small Makefile, it should be easy to run (might require editing). If you
have any problems, please let me know.
Variables for selecting the card etc. are stored in defs.h
The program is ran using the following command line: (use the correct paths to data files)
430k test cases:
./reference ../data/m.txt ../data/m_r.txt out.txt
or
./reference ../data/m_s.txt ../data/m_r_s.txt out.txt
4300k test cases:
./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt
or
./reference ../data/m_huge_s.txt ../data/m_huge_r_s.txt out.txt
The difference between normal files and _s files are the data is shuffled with the _s files which improves performance slightly.(not relevant to the problems). But
you can use any test you like. There are also 50K sized test files which I use for very quick runs only.
an example run output is below, and under that are the problems listed:
------------------------------------------------------------------------------------------------------------------------------------------------
1 platform found:
-------------------------------------------------------------------------------
platform 0*:
name: AMD Accelerated Parallel Processing
profile: FULL_PROFILE
version: OpenCL 1.2 AMD-APP (923.1)
vendor: Advanced Micro Devices, Inc.
extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
-------------------------------------------------------------------------------
* - Selected
Devices of type GPU :
-------------------------------------------------------------------------------
0* Cypress
1 Tahiti
-------------------------------------------------------------------------------
* - Selected
Device 0 log:
"/tmp/OCLyYW4br.cl", line 150: warning: null (zero) character in input line
ignored
^
Warning: galaxyz kernel has register spilling. Lower performance is expected.
../data/50k.txt contains 50000 lines
first item: 52.660000 10.900000
last item: 10.620000 40.070000
../data/50k_r.txt contains 50000 lines
first item: 76.089050 32.209370
last item: 80.482910 22.944120
Total time for GalaXYZ input data MPI_Bcast = 0.0 seconds
Real 50000 Sim 50000 Hist 257
Getting total number of worker threads
Total number of worker threads 1
Slave node 0 thread 0 sending 0
Master node 0 waiting
Master node 0 received id 0 thread 0
Master node sending 0 25000 to node 0 thread 0
Slave node 0 thread 0 waiting
Slave node 0 thread 0 received 0 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 0 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0
Master node 0 waiting
Slave node 0 thread 0 offset 0 length 25000 events 1 time 0.36 seconds
Slave node 0 thread 0 finished 0 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 25000 25000 to node 0 thread 0
Master finished. Starting exit procedure...
Slave node 0 thread 0 received 25000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0
Slave node 0 thread 0 offset 25000 length 25000 events 1 time 0.23 seconds
Slave node 0 thread 0 finished 25000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Sending exit message to node 0 thread 0
Slave node 0 thread 0 received -1 -1
WALL time for GalaXYZ kernel = 0.6 seconds
MPI WALL time for GalaXYZ kernel = 0.6 seconds
CPU time for GalaXYZ kernel = 0.6 seconds
Doubling DD angle histogram..., histogram count = 1422090528
Calculated = 711020264
>=256 = 538954736
Total = 1249975000
DR angle histogram count = 194504329
Calculated = 194504329
>=256 = 2305495671
Total = 2500000000
Doubling RR angle histogram..., histogram count = 18528234
Calculated = 9239117
>=256 = 1240735883
Total = 1249975000
------------------------------------------------------------------------------------------------------------------------------------------------
Hi yurtsen,
I have seen many people complaining about performance dropdown with 13,1 driver. I have already this issue with a particular test case. But it would certainly help, if i can attach more testcases here so that a more appropriate solution can be found.
Can You please attach some testcase which can show the performance dropdown (probably the hsaigol.zip file is appropriate, please confirm). Also let me know the system Details (OS, 32-64 bit, GPUs present, perf observed with 12.10 and 13.1 driver).
Thanks for your support.
himanshu, the issue of this thread was already solved in a previous driver release (its an old thread). Thanks for asking about the performance issue. However I am not able to attach the problem code at this point to public forun. Do you know AMD developers already found the reason of it or do they need more information? (maybe there is not even need to attach code?)
By the way, is AMD interested in looking at OpenCL CPU performance problems? If I had a program which performs much better with Intel SDK for example?
Thanks,
Evren
Do you know AMD developers already found the reason of it or do they need more information? (maybe there is not even need to attach code?)
I just thought you might have another testcase for showing performance dropdown with 13.1. I am not aware if there is a bug already for this issue (and it has been fixed ).
By the way, is AMD interested in looking at OpenCL CPU performance problems? If I had a program which performs much better with Intel SDK for example?
I will ask someone about it, and let you know.
Hi yurtsen,
Even though OpenCL is primarily for GPUs, CPU performance issues are also very relevant to AMD. They will be taken care on case by case basis.