AnsweredAssumed Answered

Tahiti 7970 lockup no problem in 5870 or Nvidia devices...

Question asked by yurtesen on Aug 11, 2012
Latest reply on Mar 20, 2013 by himanshu.gautam

I have a very simple program. The data is transferred into GPU memory in the beginning of the

program, and the main program just queues kernel runs with different starting offsets (waits for

previous run to finish before queing another run). When the whole range is executed, it reads the

results from the card.

 

It is sad because AMD's GCN and even older cards beat Nvidia counterparts greatly in performance of these calculations, yet something simply do not function properly so we are forced to use Nvidia hardware

 

Problem1:

The program enqueues a kernel run, wait for it to finish and enqueue another. There is significant time when starting kernel runs (on Tahiti) even though there shouldnt be any data transfer at all.

 

A case with global size of 430932 (rounded up to 431104) takes 36.7 seconds

to run when the kernel is enqueued once. If the kernel is enqueued with global size of 50000 and using offsets (rounded to 50176) and run in 9 pieces, the total runtime is 44.2 seconds. The overhead is almost a second per kernel enqueue on Tahiti. On Cypress the difference is only 42.5 vs 44 seconds

 

Note: In the program, block size is set from defs.h file using variable WORKSIZE

 

 

Problem2:

This works fine for example if the whole range was executed all at once. Lets say 0 to 4000000 all at once,

but if I try 0-50000, 50000-100000, 150000-200000 etc. then after 8-10 enqueues it gets stuck and the only

way to recover is rebooting the box. (yes the global size was rounded up to multiple of worksize)

 

This happens only with Tahiti and NOT with for example Cypress (5870) or Nvidia Tesla cards.

 

I am providing the source code if anybody wants to have a look:

 

Program:

http://users.abo.fi/eyurtese/amd/galaxyz.tgz

Data Files:

http://users.abo.fi/eyurtese/amd/galaxy_data.tgz

 

 

The program includes a small Makefile, it should be easy to run (might require editing). If you

have any problems, please let me know.

 

Variables for selecting the card etc. are stored in defs.h

 

The program is ran using the following command line: (use the correct paths to data files)

 

430k test cases:

./reference ../data/m.txt ../data/m_r.txt out.txt

or

./reference ../data/m_s.txt ../data/m_r_s.txt out.txt

 

4300k test cases:

./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt

or

./reference ../data/m_huge_s.txt ../data/m_huge_r_s.txt out.txt

 

The difference between normal files and _s files are the data is shuffled with the _s files which improves performance slightly.(not relevant to the problems). But

you can use any test you like. There are also 50K sized test files which I use for very quick runs only.

 

an example run output is below, and under that are the problems listed:

 

------------------------------------------------------------------------------------------------------------------------------------------------

1 platform found:

-------------------------------------------------------------------------------

platform 0*:

        name: AMD Accelerated Parallel Processing

        profile: FULL_PROFILE

        version: OpenCL 1.2 AMD-APP (923.1)

        vendor: Advanced Micro Devices, Inc.

        extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

-------------------------------------------------------------------------------

* - Selected

 

Devices of type GPU :

-------------------------------------------------------------------------------

0*      Cypress

1       Tahiti

-------------------------------------------------------------------------------

* - Selected

Device 0 log:

"/tmp/OCLyYW4br.cl", line 150: warning: null (zero) character in input line

          ignored

 

  ^

 

Warning: galaxyz kernel has register spilling. Lower performance is expected.

 

 

../data/50k.txt contains 50000 lines

     first item: 52.660000 10.900000

      last item: 10.620000 40.070000

../data/50k_r.txt contains 50000 lines

     first item: 76.089050 32.209370

      last item: 80.482910 22.944120

 

Total time for GalaXYZ input data MPI_Bcast =    0.0 seconds

Real 50000 Sim 50000 Hist 257

Getting total number of worker threads

Total number of worker threads 1

Slave node 0 thread 0 sending 0

Master node 0 waiting

Master node 0 received id 0 thread 0

Master node sending 0 25000 to node 0 thread 0

Slave node 0 thread 0 waiting

Slave node 0 thread 0 received 0 25000

 

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 0 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0

Master node 0 waiting

Slave node 0 thread 0 offset 0 length 25000 events 1 time 0.36 seconds

Slave node 0 thread 0 finished 0 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 25000 25000 to node 0 thread 0

Master finished. Starting exit procedure...

Slave node 0 thread 0 received 25000 25000

 

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0

Slave node 0 thread 0 offset 25000 length 25000 events 1 time 0.23 seconds

Slave node 0 thread 0 finished 25000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Sending exit message to node 0 thread 0

Slave node 0 thread 0 received -1 -1

 

WALL time for GalaXYZ kernel =    0.6 seconds

MPI WALL time for GalaXYZ kernel =    0.6 seconds

CPU time for GalaXYZ kernel  =    0.6 seconds

 

Doubling DD angle histogram...,  histogram count = 1422090528

                                 Calculated      = 711020264

                                 >=256           = 538954736

                                 Total           = 1249975000

 

DR angle                         histogram count = 194504329

                                 Calculated      = 194504329

                                 >=256           = 2305495671

                                 Total           = 2500000000

 

Doubling RR angle histogram...,  histogram count = 18528234

                                 Calculated      = 9239117

                                 >=256           = 1240735883

                                 Total           = 1249975000

 

------------------------------------------------------------------------------------------------------------------------------------------------

Outcomes