Archives Discussions

yurtesen · ‎08-11-2012

I have a very simple program. The data is transferred into GPU memory in the beginning of the

program, and the main program just queues kernel runs with different starting offsets (waits for

previous run to finish before queing another run). When the whole range is executed, it reads the

results from the card.

It is sad because AMD's GCN and even older cards beat Nvidia counterparts greatly in performance of these calculations, yet something simply do not function properly so we are forced to use Nvidia hardware

Problem1:

The program enqueues a kernel run, wait for it to finish and enqueue another. There is significant time when starting kernel runs (on Tahiti) even though there shouldnt be any data transfer at all.

A case with global size of 430932 (rounded up to 431104) takes 36.7 seconds

to run when the kernel is enqueued once. If the kernel is enqueued with global size of 50000 and using offsets (rounded to 50176) and run in 9 pieces, the total runtime is 44.2 seconds. The overhead is almost a second per kernel enqueue on Tahiti. On Cypress the difference is only 42.5 vs 44 seconds

Note: In the program, block size is set from defs.h file using variable WORKSIZE

Problem2:

This works fine for example if the whole range was executed all at once. Lets say 0 to 4000000 all at once,

but if I try 0-50000, 50000-100000, 150000-200000 etc. then after 8-10 enqueues it gets stuck and the only

way to recover is rebooting the box. (yes the global size was rounded up to multiple of worksize)

This happens only with Tahiti and NOT with for example Cypress (5870) or Nvidia Tesla cards.

I am providing the source code if anybody wants to have a look:

Program:

http://users.abo.fi/eyurtese/amd/galaxyz.tgz

Data Files:

http://users.abo.fi/eyurtese/amd/galaxy_data.tgz

The program includes a small Makefile, it should be easy to run (might require editing). If you

have any problems, please let me know.

Variables for selecting the card etc. are stored in defs.h

The program is ran using the following command line: (use the correct paths to data files)

430k test cases:

./reference ../data/m.txt ../data/m_r.txt out.txt

or

./reference ../data/m_s.txt ../data/m_r_s.txt out.txt

4300k test cases:

./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt

or

./reference ../data/m_huge_s.txt ../data/m_huge_r_s.txt out.txt

The difference between normal files and _s files are the data is shuffled with the _s files which improves performance slightly.(not relevant to the problems). But

you can use any test you like. There are also 50K sized test files which I use for very quick runs only.

an example run output is below, and under that are the problems listed:

------------------------------------------------------------------------------------------------------------------------------------------------

1 platform found:

-------------------------------------------------------------------------------

platform 0*:

name: AMD Accelerated Parallel Processing

profile: FULL_PROFILE

version: OpenCL 1.2 AMD-APP (923.1)

vendor: Advanced Micro Devices, Inc.

extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

-------------------------------------------------------------------------------

* - Selected

Devices of type GPU :

-------------------------------------------------------------------------------

0* Cypress

1 Tahiti

-------------------------------------------------------------------------------

* - Selected

Device 0 log:

"/tmp/OCLyYW4br.cl", line 150: warning: null (zero) character in input line

ignored

^

Warning: galaxyz kernel has register spilling. Lower performance is expected.

../data/50k.txt contains 50000 lines

first item: 52.660000 10.900000

last item: 10.620000 40.070000

../data/50k_r.txt contains 50000 lines

first item: 76.089050 32.209370

last item: 80.482910 22.944120

Total time for GalaXYZ input data MPI_Bcast = 0.0 seconds

Real 50000 Sim 50000 Hist 257

Getting total number of worker threads

Total number of worker threads 1

Slave node 0 thread 0 sending 0

Master node 0 waiting

Master node 0 received id 0 thread 0

Master node sending 0 25000 to node 0 thread 0

Slave node 0 thread 0 waiting

Slave node 0 thread 0 received 0 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 0 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0

Master node 0 waiting

Slave node 0 thread 0 offset 0 length 25000 events 1 time 0.36 seconds

Slave node 0 thread 0 finished 0 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 25000 25000 to node 0 thread 0

Master finished. Starting exit procedure...

Slave node 0 thread 0 received 25000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0

Slave node 0 thread 0 offset 25000 length 25000 events 1 time 0.23 seconds

Slave node 0 thread 0 finished 25000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Sending exit message to node 0 thread 0

Slave node 0 thread 0 received -1 -1

WALL time for GalaXYZ kernel = 0.6 seconds

MPI WALL time for GalaXYZ kernel = 0.6 seconds

CPU time for GalaXYZ kernel = 0.6 seconds

Doubling DD angle histogram..., histogram count = 1422090528

Calculated = 711020264

>=256 = 538954736

Total = 1249975000

DR angle histogram count = 194504329

Calculated = 194504329

>=256 = 2305495671

Total = 2500000000

Doubling RR angle histogram..., histogram count = 18528234

Calculated = 9239117

>=256 = 1240735883

Total = 1249975000

------------------------------------------------------------------------------------------------------------------------------------------------

binying · ‎08-13-2012

well, let me take a look...

drallan · ‎08-14-2012

Hi Yurtesen,

I took a quick look but don't have openmp, but I think the timing and hangup problems are probably different.

One question about the hangup, does it get better if you use a workgroup size of 64, or is it the same? The open clmemtest problem was bad on the 7970 because the 7970's wavefronts can be fairly independent, a workgroup size of 64 keeps each workgroup at one wavefront.

yurtesen · ‎08-15-2012

Drallan, you will also need MPI but I can provide you a binary if you like? The program runs a master process which tells to slave process(es) which ranges to execute. (this is needed when it is run on a multi-gpu multi-node cluster).

It works fine if I enqueue the whole range in one go. Do you think that sort of problem might appear if I enqueue kernel several times with non-overlapping, increasing ranges? The input and output data are totally independent of each other in the program.

Never the less, I set the workgroup size to 64 and still the same problem.

Total time for GalaXYZ input data MPI_Bcast = 0.0 seconds
Real 4309320 Sim 4309320 Hist 257
Getting total number of worker threads
Total number of worker threads 1
Master node sending 0 25000 to node 0 thread 0
Slave node 0 thread 0 offset 0 length 25000 events 1 time 18.06 seconds
Master node sending 25000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 25000 length 25000 events 1 time 18.29 seconds
Master node sending 50000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 50000 length 25000 events 1 time 18.42 seconds
Master node sending 75000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 75000 length 25000 events 1 time 18.32 seconds
Master node sending 100000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 100000 length 25000 events 1 time 18.76 seconds
Master node sending 125000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 125000 length 25000 events 1 time 18.94 seconds
Master node sending 150000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 150000 length 25000 events 1 time 19.68 seconds
Master node sending 175000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 175000 length 25000 events 1 time 19.48 seconds
Master node sending 200000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 200000 length 25000 events 1 time 19.14 seconds
Master node sending 225000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 225000 length 25000 events 1 time 19.17 seconds
Master node sending 250000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 250000 length 25000 events 1 time 18.19 seconds
Master node sending 275000 25000 to node 0 thread 0

Thats it... it stucks and error is:

[ 8173.163437] [fglrx] ASIC hang happened
[ 8173.163446] Pid: 2511, comm: reference Tainted: P            3.0.0-23-generic #39-Ubuntu
[ 8173.163451] Call Trace:
[ 8173.163523] [<ffffffffa011a0ce>] KCL_DEBUG_OsDump+0xe/0x10 [fglrx]
[ 8173.163574] [<ffffffffa01275ac>] firegl_hardwareHangRecovery+0x1c/0x30 [fglrx]
[ 8173.163669] [<ffffffffa01a0a59>] ? _ZN4Asic9WaitUntil15ResetASICIfHungEv+0x9/0x10 [fglrx]
[ 8173.163761] [<ffffffffa01a09fc>] ? _ZN4Asic9WaitUntil15WaitForCompleteEv+0x9c/0xf0 [fglrx]
[ 8173.163865] [<ffffffffa01b1301>] ? _ZN4Asic19PM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0x141/0x160 [fglrx]
[ 8173.163923] [<ffffffffa01464a2>] ? firegl_trace+0x72/0x1e0 [fglrx]
[ 8173.163980] [<ffffffffa01464a2>] ? firegl_trace+0x72/0x1e0 [fglrx]
[ 8173.164082] [<ffffffffa01a82a3>] ? _ZN15QS_PRIVATE_CORE27multiVpuPM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0x33/0x50 [fglrx]
[ 8173.164226] [<ffffffffa019ffb9>] ? _Z15uQSPM4TimestampmP20_QS_PM4_TS_PACKET_IN+0x69/0x70 [fglrx]
[ 8173.164326] [<ffffffffa019b31d>] ? _Z8uCWDDEQCmjjPvjS_+0x5dd/0x10c0 [fglrx]
[ 8173.164340] [<ffffffff8108747e>] ? down+0x2e/0x50
[ 8173.164405] [<ffffffffa0149baf>] ? firegl_cmmqs_CWDDE_32+0x36f/0x480 [fglrx]
[ 8173.164469] [<ffffffffa014829e>] ? firegl_cmmqs_CWDDE32+0x6e/0x100 [fglrx]
[ 8173.164483] [<ffffffff8128559a>] ? security_capable+0x2a/0x30
[ 8173.164547] [<ffffffffa0148230>] ? firegl_cmmqs_createdriver+0x170/0x170 [fglrx]
[ 8173.164600] [<ffffffffa01232ad>] ? firegl_ioctl+0x1ed/0x250 [fglrx]
[ 8173.164645] [<ffffffffa01139be>] ? ip_firegl_unlocked_ioctl+0xe/0x20 [fglrx]
[ 8173.164658] [<ffffffff8117a96a>] ? do_vfs_ioctl+0x8a/0x340
[ 8173.164671] [<ffffffff810985da>] ? sys_futex+0x10a/0x1a0
[ 8173.164682] [<ffffffff8117acb1>] ? sys_ioctl+0x91/0xa0
[ 8173.164695] [<ffffffff815fd402>] ? system_call_fastpath+0x16/0x1b
[ 8173.164707] pubdev:0xffffffffa0335c80, num of device:1 , name:fglrx, major 8, minor 98.
[ 8173.164718] device 0 : 0xffff88042491c000 .
[ 8173.164727] Asic ID:0x6798, revision:0x5, MMIOReg:0xffffc90015300000.
[ 8173.164737] FB phys addr: 0xc0000000, MC :0xf400000000, Total FB size :0xc0000000.
[ 8173.164746] gart table MC:0xf40f8fd000, Physical:0xcf8fd000, size:0x402000.
[ 8173.164755] mc_node :FB, total 1 zones
[ 8173.164763]     MC start:0xf400000000, Physical:0xc0000000, size:0xfd00000.
[ 8173.164773]     Mapped heap -- Offset:0x0, size:0xf8fd000, reference count:19, mapping count:0,
[ 8173.164785]     Mapped heap -- Offset:0x0, size:0x1000000, reference count:1, mapping count:0,
[ 8173.164795]     Mapped heap -- Offset:0xf8fd000, size:0x403000, reference count:1, mapping count:0,
[ 8173.164805] mc_node :INV_FB, total 1 zones
[ 8173.164813]     MC start:0xf40fd00000, Physical:0xcfd00000, size:0xb0300000.
[ 8173.164823]     Mapped heap -- Offset:0x2f8000, size:0x8000, reference count:1, mapping count:0,
[ 8173.164834]     Mapped heap -- Offset:0xb02f4000, size:0xc000, reference count:1, mapping count:0,
[ 8173.164845] mc_node :GART_USWC, total 3 zones
[ 8173.164852]     MC start:0xffa0100000, Physical:0x0, size:0x50000000.
[ 8173.164862]     Mapped heap -- Offset:0x0, size:0x2000000, reference count:16, mapping count:0,
[ 8173.164872] mc_node :GART_CACHEABLE, total 3 zones
[ 8173.164881]     MC start:0xff70400000, Physical:0x0, size:0x2fd00000.
[ 8173.164890]     Mapped heap -- Offset:0xc00000, size:0x100000, reference count:2, mapping count:0,
[ 8173.164901]     Mapped heap -- Offset:0xb00000, size:0x100000, reference count:1, mapping count:0,
[ 8173.164912]     Mapped heap -- Offset:0x200000, size:0x900000, reference count:3, mapping count:0,
[ 8173.164923]     Mapped heap -- Offset:0x0, size:0x200000, reference count:5, mapping count:0,
[ 8173.164934]     Mapped heap -- Offset:0xef000, size:0x11000, reference count:1, mapping count:0,
[ 8173.164945] GRBM : 0xa0407028, SRBM : 0x200000c0 .
[ 8173.164956] CP_RB_BASE : 0xffa01000, CP_RB_RPTR : 0x7330 , CP_RB_WPTR :0x7330.
[ 8173.164967] CP_IB1_BUFSZ:0x0, CP_IB1_BASE_HI:0xff, CP_IB1_BASE_LO:0xa0851000.
[ 8173.164976] last submit IB buffer -- MC :0xffa0851000,phys:0x4ece000.
[ 8173.164992] Dump the trace queue.
[ 8173.164999] End of dump

In this case, it took about 200-220seconds(after 11 enqueues) for the problem to appear. If I increase the range sent to slave to 50000 then it takes about 270-300 seconds(after 8 enqueues).

binying · ‎08-15-2012

so can you provide a link to the MPI binary?

yurtesen · ‎08-16-2012

No, I meant the whole program compiled. But I am not sure what that might help. You can get MPI freely on any Linux operating system easily. Use mpich2 for example (you can also use your favourite package manager apt-get, yum etc. to install it easily on your system). :

http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads

Also, OpenMP is part of GCC (I just remembered) so I am not sure how come drallan does not have it

http://gcc.gnu.org/wiki/openmp

As of GCC 4.2, the compiler implements version 2.5 of the OpenMP standard and as of 4.4 it implements version 3.0 of the OpenMP standard. The OpenMP 3.1 is supported since GCC 4.7.

binying · ‎08-16-2012

oh. ok. I'd like to have a copy of the binary if you don't mind. And I am gonna install mpich2 in order to repeat what you've seen.

drallan · ‎08-16-2012

Also, OpenMP is part of GCC (I just remembered) so I am not sure how come drallan does not have it

http://gcc.gnu.org/wiki/openmp

Is it? then I should have it. Guess I was too busy writing AMD assembly. Take a look tomorrow.

yurtesen · ‎08-17-2012

drallan wrote:
Is it? then I should have it. Guess I was too busy writing AMD assembly. Take a look tomorrow.

I would appreciate it a lot, but you will still need MPI to be able to compile the program...

nnunn · ‎08-18-2012

This does sound familiar! We have no probs with a single Cayman (or 3 x GTX580 using mpich2). But link with OpenMPI and the proc on a single Tahiti box dies. Easy fix for us was to link with mpich2 on the Tahiti box.

yurtesen · ‎08-19-2012

nnunn@ausport.gov.au wrote:
This does sound familiar! We have no probs with a single Cayman (or 3 x GTX580 using mpich2). But link with OpenMPI and the proc on a single Tahiti box dies. Easy fix for us was to link with mpich2 on the Tahiti box.

I dont quite understand what you mean? What do you mean by "dies" exactly?

Also, we were talking with OpenMP not OpenMPI and I used OpenMP to be able to enqueue to multiple devices concurrently using multiple contexts using threads. Although, the problem occurs with a single box with single gpu also.

binying · ‎08-20-2012

sh run.sh

rm -f *.o reference

mpicc -Wall -O3 -g -I./include/ -I/usr/include/mpich2-x86_64 -I. -fopenmp -c -o reference.o reference.cpp

mpicc -Wall -O3 -g -I./include/ -I/usr/include/mpich2-x86_64 -I. -fopenmp -c -o opencl.o opencl.cpp

mpicc -o reference reference.o opencl.o -I./include/ -I/usr/include/mpich2-x86_64 -I. -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich -lmpl -lgomp

1 platform found:

-------------------------------------------------------------------------------

platform 0*:

name: AMD Accelerated Parallel Processing

profile: FULL_PROFILE

version: OpenCL 1.2 AMD-APP (938.1)

vendor: Advanced Micro Devices, Inc.

extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

-------------------------------------------------------------------------------

* - Selected

Devices of type GPU :

-------------------------------------------------------------------------------

0* Cayman

1 Cayman

-------------------------------------------------------------------------------

* - Selected

Device 0 log:

"/tmp/OCLxkgZoq.cl", line 150: warning: null (zero) character in input line

ignored

^

Warning: galaxyz kernel has register spilling. Lower performance is expected.

../data/m.txt contains 430932 lines

first item: 52.660000 10.900000

last item: 86.260002 8.090000

../data/m_r.txt contains 430932 lines

first item: 76.089050 32.209370

last item: 27.345739 38.801189

Total time for GalaXYZ input data MPI_Bcast = 0.0 seconds

Real 430932 Sim 430932 Hist 257

Getting total number of worker threads

Total number of worker threads 1

Slave node 0 thread 0 sending 0

Master node 0 waiting

Master node 0 received id 0 thread 0

Master node sending 0 25000 to node 0 thread 0

Slave node 0 thread 0 waiting

Slave node 0 thread 0 received 0 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 0 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0

Master node 0 waiting

Slave node 0 thread 0 offset 0 length 25000 events 1 time 5.85 seconds

Slave node 0 thread 0 finished 0 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 25000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 25000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0

Slave node 0 thread 0 offset 25000 length 25000 events 1 time 5.58 seconds

Slave node 0 thread 0 finished 25000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 50000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 50000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 50000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 75000, remaining lines 0

Slave node 0 thread 0 offset 50000 length 25000 events 1 time 5.36 seconds

Slave node 0 thread 0 finished 50000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 75000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 75000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 75000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 100000, remaining lines 0

Slave node 0 thread 0 offset 75000 length 25000 events 1 time 5.15 seconds

Slave node 0 thread 0 finished 75000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 100000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 100000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 100000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 125000, remaining lines 0

Slave node 0 thread 0 offset 100000 length 25000 events 1 time 4.95 seconds

Slave node 0 thread 0 finished 100000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 125000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 125000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 125000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 150000, remaining lines 0

Slave node 0 thread 0 offset 125000 length 25000 events 1 time 4.70 seconds

Slave node 0 thread 0 finished 125000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 150000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 150000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 150000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 175000, remaining lines 0

Slave node 0 thread 0 offset 150000 length 25000 events 1 time 4.46 seconds

Slave node 0 thread 0 finished 150000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 175000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 175000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 175000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 200000, remaining lines 0

Slave node 0 thread 0 offset 175000 length 25000 events 1 time 4.23 seconds

Slave node 0 thread 0 finished 175000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 200000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 200000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 200000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 225000, remaining lines 0

Slave node 0 thread 0 offset 200000 length 25000 events 1 time 3.99 seconds

Slave node 0 thread 0 finished 200000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 225000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 225000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 225000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 250000, remaining lines 0

Slave node 0 thread 0 offset 225000 length 25000 events 1 time 3.73 seconds

Slave node 0 thread 0 finished 225000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 250000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 250000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 250000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 275000, remaining lines 0

Slave node 0 thread 0 offset 250000 length 25000 events 1 time 3.47 seconds

Slave node 0 thread 0 finished 250000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 275000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 275000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 275000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 300000, remaining lines 0

Slave node 0 thread 0 offset 275000 length 25000 events 1 time 3.25 seconds

Slave node 0 thread 0 finished 275000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 300000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 300000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 300000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 325000, remaining lines 0

Slave node 0 thread 0 offset 300000 length 25000 events 1 time 3.02 seconds

Slave node 0 thread 0 finished 300000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 325000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 325000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 325000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 350000, remaining lines 0

Slave node 0 thread 0 offset 325000 length 25000 events 1 time 2.79 seconds

Slave node 0 thread 0 finished 325000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 350000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 350000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 350000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 375000, remaining lines 0

Slave node 0 thread 0 offset 350000 length 25000 events 1 time 2.55 seconds

Slave node 0 thread 0 finished 350000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 375000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 375000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 375000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 400000, remaining lines 0

Slave node 0 thread 0 offset 375000 length 25000 events 1 time 2.31 seconds

Slave node 0 thread 0 finished 375000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 400000 25000 to node 0 thread 0

Master node 0 waiting

Slave node 0 thread 0 received 400000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 400000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 425000, remaining lines 0

Slave node 0 thread 0 offset 400000 length 25000 events 1 time 2.07 seconds

Slave node 0 thread 0 finished 400000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 425000 5932 to node 0 thread 0

Master finished. Starting exit procedure...

Slave node 0 thread 0 received 425000 5932

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 425000 global 6144 with vector size 1

Node/Dev 0/0: First kernel processes 5932 lines with localmax 430932, remaining lines 0

Slave node 0 thread 0 offset 425000 length 5932 events 1 time 0.49 seconds

Slave node 0 thread 0 finished 425000 5932

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Sending exit message to node 0 thread 0

Slave node 0 thread 0 received -1 -1

WALL time for GalaXYZ kernel = 68.1 seconds

MPI WALL time for GalaXYZ kernel = 68.1 seconds

CPU time for GalaXYZ kernel = 68.1 seconds

Doubling DD angle histogram..., histogram count = 114799465750

Calculated = 57399517409

>=256 = 35451461437

Total = 92850978846

DR angle histogram count = 98903437674

Calculated = 98903437674

>=256 = 86798950950

Total = 185702388624

Doubling RR angle histogram..., histogram count = 94429502254

Calculated = 47214535661

>=256 = 45636443185

Total = 92850978846

27.62user 41.80system 1:09.43elapsed 99%CPU (0avgtext+0avgdata 258208maxresident)k

24696inputs+944outputs (1major+21819minor)pagefaults 0swaps

binying · ‎08-20-2012

So the lockup problem didn't occur on Caryman as u can see above, right? I am using ubuntu/sdk2.7, btw.

yurtesen · ‎08-21-2012

binying wrote:
So the lockup problem didn't occur on Caryman as u can see above, right? I am using ubuntu/sdk2.7, btw.

Thanks for testing it. It also doesnt occur with Cypress. It only locks up with Tahiti (I recommend testing it with Tahiti if you have one.). The program seem to work fine on anything else than Tahiti as far as I can tell.

Also you can have a go with the larger data set it takes much longer to process each piece with it.

./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt

The question is, is this a driver/sdk bug on Tahiti? If yes, how can I get AMD to take action???

kcarney · ‎08-23-2012

I'll ask a Tahiti person to jump in to this discussion

Update: an internal ticket has been filed to put this issue in the queue for additional research

yurtesen · ‎08-29-2012

Thank you Kristen. Please let me know if AMD needs more information or anything else for fixing this issue.

liwoog · ‎09-05-2012

I have been having a similar problem with OpenMP.

See http://devgurus.amd.com/thread/158392

yurtesen · ‎09-06-2012

liwoog, openmp is not mentioned in the thread you linked?

liwoog · ‎09-06-2012

True, I failed to mention it there. But I am using OpenMP.

yurtesen · ‎09-06-2012

liwoog wrote:
True, I failed to mention it there. But I am using OpenMP.

I think I had a version of the code which did not use OpenMP. I will check that out and return back but I am not sure how that can cause the problem that I am having... we shall see...

liwoog · ‎09-06-2012

I probably disabled OpenMP at some point too, just to check why the GPU was hanging. Hence why I did not mention it in my post. I read in another post that the hang was due to killing a process that had allocated more than 256MB on the card. I mostly gave up on the cards because of it.

One more thing to add. Running through the AMD Profiler seems to prevents the program from hanging at runtime. It will still hang if interrupted.

yurtesen · ‎09-06-2012

I am not killing any processes. I will do some changes in the code and make a non open-mp version and also I think I might have an idea about something else to test. I will let you guys know when I get around to test them. Still, it is strange that only Tahiti crashes....

nnunn · ‎09-07-2012

First thought was a threading issue (this is why I mentioned our experience with OpenMPI). If both OpenMP and OpenMPI are causing issues for Tahiti (but not Cayman or Cypress, and mpich2 just works), maybe some 'feature' in the way both OpenMP and OpenMPI handle their threads is interfering with Tahiti optimizations?

yurtesen · ‎09-11-2012

I made a non OpenMP / MPI version and it is still crashing only on Tahiti exactly the same way. I think the issue is not related to OpenMP / MPI

drallan · ‎09-11-2012

That will certainly make it easier!

Can you post this version of the code? I'll see what happens in the windows environment.

drallan

yurtesen · ‎09-11-2012

Yes, but I have to warn, it is an ugly hack of a different version of the code and output is not exactly same etc. Let me know if you have any problems and thank you!

By the way the execution steps probably overlap a little bit due to steps are not exact multiple of workgroup size and I didnt put an if statement inside the code which exits execution if the thread id is over the last item in the step. For example for 0 to 25000, the global id 25002 can be executed even thuogh it is larger than 25000, in next step it is executed again. But it just means some stuff is calculated twice (and results are wrong, but for test program I dont care). Anyway, the point was that it shouldnt be helping the gpu to crash and burn

drallan · ‎09-11-2012

Beautiful ugly hack, it only took a few minutes to get to compile. I'll try running it tomorrow.

1. Are you compiling a 64 bit version?

I can compile either way but mingw requires "long long" for a 64 bit variable in either 32 or 64 bit mode.

Something to do with windows.

drallan

yurtesen · ‎09-11-2012

I am using 64bit and compile on Ubuntu 11.10. I guess one other thing to look at is if you are getting the same results at each run.

I thought you would get exactly 10 times more with 4.3m case compared to 430k case but the version of the program you have is a quick hack and some elements are processed twice so this probably is not the case for you. But the first program I posted should perhaps do 10 times larger results only.

hsaigol · ‎09-10-2012

i have tried this on newer internal drivers on ubuntu 10.04 64bit and i was able to loop the 4.3million version over 72hours (68 loops). So i hope a future driver will fix the issue for you.
also as a side question if i compare the out.txt from run to run should i see differences?

yurtesen · ‎09-10-2012

Well, it is difficult to say if the results should be same or not. It depends if the hardware or threads are somehow re-ordering operations with FP numbers at each run. Can that happen? I am not an expert on what OpenCL does internally... If yes, due to differences in the order of operations, slight differences can occur. If not... let me know

Do you mean that you re-ran 4.3million version over and over for 72 hours? Thanks for that (I just ask because a single run shouldnt take that long).

I was just working on a version "without" MPI and OpenMP just for testing this issue and I might be able to finish it tomorrow maybe. Do you know what was the problem exactly?

hsaigol · ‎09-10-2012

Yes i was running the 4.3 million version over and over for a total of 72hours.
I have no idea if the operations will be re-ordered or not i am only executing your code and testing if it fails. This was more of a personal exercise and also from how you explain your code shouldn't have caused hangs in the first place.

liwoog · ‎09-10-2012

I have been running my code on the latest available drivers and while the code runs fine with one set of parameters, it hangs with another in which the kernels take longer to run. I believe that something happens after a kernel runs for over 5min.

yurtesen · ‎09-10-2012

Thanks for testing, but how do I get the latest internal drivers for myself?

About slightly different results, I am getting same results on Cypress (I think I got different results earlier today because the card was quite a bit overclocked ). I will re-test on Tahiti later tomorrow (hopefully), but results should be same I think.

Sorry for the earlier confusing comment I made about different results. I am now getting same results at factory clocks also

hsaigol · ‎09-10-2012

hi Yurtesen

when i run the 50k input file i get consistent results every time, even when i run the m.txt and m_r.txt input files i get consistent output results between runs but when i run the 4.3 million file i get different outputs run to run. *scratching head*

is there any expected output file i can compare the output data to and see what is going on. Can you provide one?

yurtesen · ‎09-11-2012

What about the 430k file? I was testing with 430k file? I tried it on Tahiti and it gives consistent results on Tahiti as well.

But you seem to be correct, I seem to be getting different results with 4.3m file. I will run it on some nvidia cards and amd cypress, and let you know if I get different results there also. It might take a few days.

hsaigol · ‎09-11-2012

m.txt and m_r.txt are the 430k files, they produce consistent results

so i ran the 4.3million line version again last night on a tahiti ghz edition

the numbers correspond to the loop number, the outputs matched for example from loop 17,14,2,3,4...

the outputs from loop 1,11,13 matched but were different from those produced by loop 17,14,2,3...

the outputs from loop 7,15 did not match anything

Header 1	Header 2	Header 3	Header 4	Header 5	Header 6	Header 7	Header 8	Header 9	Header 10	Header 11	Header 12	Header 13
match	2	3	4	5	6	8	9	10	12	14	16	17
diff from above	1	7	11	13	15
sub match	1	11	13

yurtesen · ‎09-11-2012

hsaigol, I will run it on some other devices, nvidia, cpu etc. and return back to you. I believe the 4.3m file results should be exactly 10 times more than 430k results, but it appears it is rarely the case.

hsaigol · ‎09-11-2012

i'll wait for your reply, if i get a chance i will try on a 78xx gpu as well and see what happens.

also what version of the driver are you using

can you open CCC and check under the information tab the exact information for the driver, more specifically driver packaging version. thanks

yurtesen · ‎09-12-2012

Meanwhile, just out of curiosity, if possible, can you have a look at the version I attached without openmp / mpi ? that might be a better test case since it is less complicated ?

drallan · ‎09-12-2012

Hi Yurtesen,

I have run the 430932 size case ~50 times and do not see any hang, I've run the huge file a couple of times and see no hang.

What I do see is excellent Tahiti performance, have you timed your new (non-MPI) program?

Tahiti is running about 4.5X faster than my Cayman. Tahiti 10.1 seconds vs Cayman 46.9 seconds.

Tahiti huge problem about 1003 seconds, (100X for N*N problem)

The 'huge' output numbers are roughly 100 times larger, as expected, with slight differences after dividing by 100.

So, I see nothing unusual so far, though I am curious about your run time for the new code.

Also, are you still seeing register spilling?

BTW, I'm running the Tahitis at 1200 Mhz.

TAHITI

-------------------------------------------------------------------------------

Real 430932 Sim 430932 Hist 257

Using workgroup size 256

Using global size 431104

Running OpenCL GalaXYZ

Queueing part 0 - 25000 of 431104... Kernel finished 1.531

Queueing part 25000 - 50000 of 431104... Kernel finished 0.870

Queueing part 50000 - 75000 of 431104... Kernel finished 0.799

[.....]

Completed OpenCL GalaXYZ

WALL time for GalaXYZ kernel = 10.1 seconds

CPU time for GalaXYZ kernel = 10.1 seconds

Doubling DD angle histogram..., histogram count = 169741846286

Calculated = 84870707677

>=256 = 0

Total = 84870707677

DR angle histogram count = 169146070638

Calculated = 169146070638

>=256 = 0

Total = 169146070638

Doubling RR angle histogram..., histogram count = 168527020850

Calculated = 84263294959

>=256 = 0

Total = 84263294959

CAYMAN

-------------------------------------------------------------------------------

Real 430932 Sim 430932 Hist 257

Using workgroup size 256

Using global size 431104

Running OpenCL GalaXYZ

Queueing part 0 - 25000 of 431104... Kernel finished 1.521

Queueing part 25000 - 50000 of 431104... Kernel finished 4.133

Queueing part 50000 - 75000 of 431104... Kernel finished 3.874

Queueing part 75000 - 100000 of 431104... Kernel finished 3.816

Queueing part 100000 - 125000 of 431104... Kernel finished 3.689

[.....]

Completed OpenCL GalaXYZ

WALL time for GalaXYZ kernel = 46.9 seconds

CPU time for GalaXYZ kernel = 46.9 seconds

Doubling DD angle histogram..., histogram count = 169741846286

Calculated = 84870707677

>=256 = 0

Total = 84870707677

DR angle histogram count = 169146070638

Calculated = 169146070638

>=256 = 0

Total = 169146070638

Doubling RR angle histogram..., histogram count = 168527020850

Calculated = 84263294959

>=256 = 0

Total = 84263294959

-------------------------------------------------------------------------------

Archives Discussions

Tahiti 7970 lockup no problem in 5870 or Nvidia devices...