Archives Discussions

yurtesen · ‎08-11-2012

I have a very simple program. The data is transferred into GPU memory in the beginning of the

program, and the main program just queues kernel runs with different starting offsets (waits for

previous run to finish before queing another run). When the whole range is executed, it reads the

results from the card.

It is sad because AMD's GCN and even older cards beat Nvidia counterparts greatly in performance of these calculations, yet something simply do not function properly so we are forced to use Nvidia hardware

Problem1:

The program enqueues a kernel run, wait for it to finish and enqueue another. There is significant time when starting kernel runs (on Tahiti) even though there shouldnt be any data transfer at all.

A case with global size of 430932 (rounded up to 431104) takes 36.7 seconds

to run when the kernel is enqueued once. If the kernel is enqueued with global size of 50000 and using offsets (rounded to 50176) and run in 9 pieces, the total runtime is 44.2 seconds. The overhead is almost a second per kernel enqueue on Tahiti. On Cypress the difference is only 42.5 vs 44 seconds

Note: In the program, block size is set from defs.h file using variable WORKSIZE

Problem2:

This works fine for example if the whole range was executed all at once. Lets say 0 to 4000000 all at once,

but if I try 0-50000, 50000-100000, 150000-200000 etc. then after 8-10 enqueues it gets stuck and the only

way to recover is rebooting the box. (yes the global size was rounded up to multiple of worksize)

This happens only with Tahiti and NOT with for example Cypress (5870) or Nvidia Tesla cards.

I am providing the source code if anybody wants to have a look:

Program:

http://users.abo.fi/eyurtese/amd/galaxyz.tgz

Data Files:

http://users.abo.fi/eyurtese/amd/galaxy_data.tgz

The program includes a small Makefile, it should be easy to run (might require editing). If you

have any problems, please let me know.

Variables for selecting the card etc. are stored in defs.h

The program is ran using the following command line: (use the correct paths to data files)

430k test cases:

./reference ../data/m.txt ../data/m_r.txt out.txt

or

./reference ../data/m_s.txt ../data/m_r_s.txt out.txt

4300k test cases:

./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt

or

./reference ../data/m_huge_s.txt ../data/m_huge_r_s.txt out.txt

The difference between normal files and _s files are the data is shuffled with the _s files which improves performance slightly.(not relevant to the problems). But

you can use any test you like. There are also 50K sized test files which I use for very quick runs only.

an example run output is below, and under that are the problems listed:

------------------------------------------------------------------------------------------------------------------------------------------------

1 platform found:

-------------------------------------------------------------------------------

platform 0*:

name: AMD Accelerated Parallel Processing

profile: FULL_PROFILE

version: OpenCL 1.2 AMD-APP (923.1)

vendor: Advanced Micro Devices, Inc.

extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

-------------------------------------------------------------------------------

* - Selected

Devices of type GPU :

-------------------------------------------------------------------------------

0* Cypress

1 Tahiti

-------------------------------------------------------------------------------

* - Selected

Device 0 log:

"/tmp/OCLyYW4br.cl", line 150: warning: null (zero) character in input line

ignored

^

Warning: galaxyz kernel has register spilling. Lower performance is expected.

../data/50k.txt contains 50000 lines

first item: 52.660000 10.900000

last item: 10.620000 40.070000

../data/50k_r.txt contains 50000 lines

first item: 76.089050 32.209370

last item: 80.482910 22.944120

Total time for GalaXYZ input data MPI_Bcast = 0.0 seconds

Real 50000 Sim 50000 Hist 257

Getting total number of worker threads

Total number of worker threads 1

Slave node 0 thread 0 sending 0

Master node 0 waiting

Master node 0 received id 0 thread 0

Master node sending 0 25000 to node 0 thread 0

Slave node 0 thread 0 waiting

Slave node 0 thread 0 received 0 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 0 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0

Master node 0 waiting

Slave node 0 thread 0 offset 0 length 25000 events 1 time 0.36 seconds

Slave node 0 thread 0 finished 0 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 25000 25000 to node 0 thread 0

Master finished. Starting exit procedure...

Slave node 0 thread 0 received 25000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0

Slave node 0 thread 0 offset 25000 length 25000 events 1 time 0.23 seconds

Slave node 0 thread 0 finished 25000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Sending exit message to node 0 thread 0

Slave node 0 thread 0 received -1 -1

WALL time for GalaXYZ kernel = 0.6 seconds

MPI WALL time for GalaXYZ kernel = 0.6 seconds

CPU time for GalaXYZ kernel = 0.6 seconds

Doubling DD angle histogram..., histogram count = 1422090528

Calculated = 711020264

>=256 = 538954736

Total = 1249975000

DR angle histogram count = 194504329

Calculated = 194504329

>=256 = 2305495671

Total = 2500000000

Doubling RR angle histogram..., histogram count = 18528234

Calculated = 9239117

>=256 = 1240735883

Total = 1249975000

------------------------------------------------------------------------------------------------------------------------------------------------

himanshu_gautam · ‎03-17-2013

Hi yurtsen,

I have seen many people complaining about performance dropdown with 13,1 driver. I have already this issue with a particular test case. But it would certainly help, if i can attach more testcases here so that a more appropriate solution can be found.

Can You please attach some testcase which can show the performance dropdown (probably the hsaigol.zip file is appropriate, please confirm). Also let me know the system Details (OS, 32-64 bit, GPUs present, perf observed with 12.10 and 13.1 driver).

Thanks for your support.

yurtesen · ‎03-17-2013

himanshu, the issue of this thread was already solved in a previous driver release (its an old thread). Thanks for asking about the performance issue. However I am not able to attach the problem code at this point to public forun. Do you know AMD developers already found the reason of it or do they need more information? (maybe there is not even need to attach code?)

By the way, is AMD interested in looking at OpenCL CPU performance problems? If I had a program which performs much better with Intel SDK for example?

Thanks,

Evren

himanshu_gautam · ‎03-20-2013

Do you know AMD developers already found the reason of it or do they need more information? (maybe there is not even need to attach code?)

I just thought you might have another testcase for showing performance dropdown with 13.1. I am not aware if there is a bug already for this issue (and it has been fixed ).

By the way, is AMD interested in looking at OpenCL CPU performance problems? If I had a program which performs much better with Intel SDK for example?

I will ask someone about it, and let you know.

himanshu_gautam · ‎03-20-2013

Hi yurtsen,

Even though OpenCL is primarily for GPUs, CPU performance issues are also very relevant to AMD. They will be taken care on case by case basis.

Archives Discussions

Tahiti 7970 lockup no problem in 5870 or Nvidia devices...