I have a very simple program. The data is transferred into GPU memory in the beginning of the
program, and the main program just queues kernel runs with different starting offsets (waits for
previous run to finish before queing another run). When the whole range is executed, it reads the
results from the card.
It is sad because AMD's GCN and even older cards beat Nvidia counterparts greatly in performance of these calculations, yet something simply do not function properly so we are forced to use Nvidia hardware
Problem1:
The program enqueues a kernel run, wait for it to finish and enqueue another. There is significant time when starting kernel runs (on Tahiti) even though there shouldnt be any data transfer at all.
A case with global size of 430932 (rounded up to 431104) takes 36.7 seconds
to run when the kernel is enqueued once. If the kernel is enqueued with global size of 50000 and using offsets (rounded to 50176) and run in 9 pieces, the total runtime is 44.2 seconds. The overhead is almost a second per kernel enqueue on Tahiti. On Cypress the difference is only 42.5 vs 44 seconds
Note: In the program, block size is set from defs.h file using variable WORKSIZE
Problem2:
This works fine for example if the whole range was executed all at once. Lets say 0 to 4000000 all at once,
but if I try 0-50000, 50000-100000, 150000-200000 etc. then after 8-10 enqueues it gets stuck and the only
way to recover is rebooting the box. (yes the global size was rounded up to multiple of worksize)
This happens only with Tahiti and NOT with for example Cypress (5870) or Nvidia Tesla cards.
I am providing the source code if anybody wants to have a look:
Program:
http://users.abo.fi/eyurtese/amd/galaxyz.tgz
Data Files:
http://users.abo.fi/eyurtese/amd/galaxy_data.tgz
The program includes a small Makefile, it should be easy to run (might require editing). If you
have any problems, please let me know.
Variables for selecting the card etc. are stored in defs.h
The program is ran using the following command line: (use the correct paths to data files)
430k test cases:
./reference ../data/m.txt ../data/m_r.txt out.txt
or
./reference ../data/m_s.txt ../data/m_r_s.txt out.txt
4300k test cases:
./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt
or
./reference ../data/m_huge_s.txt ../data/m_huge_r_s.txt out.txt
The difference between normal files and _s files are the data is shuffled with the _s files which improves performance slightly.(not relevant to the problems). But
you can use any test you like. There are also 50K sized test files which I use for very quick runs only.
an example run output is below, and under that are the problems listed:
------------------------------------------------------------------------------------------------------------------------------------------------
1 platform found:
-------------------------------------------------------------------------------
platform 0*:
name: AMD Accelerated Parallel Processing
profile: FULL_PROFILE
version: OpenCL 1.2 AMD-APP (923.1)
vendor: Advanced Micro Devices, Inc.
extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
-------------------------------------------------------------------------------
* - Selected
Devices of type GPU :
-------------------------------------------------------------------------------
0* Cypress
1 Tahiti
-------------------------------------------------------------------------------
* - Selected
Device 0 log:
"/tmp/OCLyYW4br.cl", line 150: warning: null (zero) character in input line
ignored
^
Warning: galaxyz kernel has register spilling. Lower performance is expected.
../data/50k.txt contains 50000 lines
first item: 52.660000 10.900000
last item: 10.620000 40.070000
../data/50k_r.txt contains 50000 lines
first item: 76.089050 32.209370
last item: 80.482910 22.944120
Total time for GalaXYZ input data MPI_Bcast = 0.0 seconds
Real 50000 Sim 50000 Hist 257
Getting total number of worker threads
Total number of worker threads 1
Slave node 0 thread 0 sending 0
Master node 0 waiting
Master node 0 received id 0 thread 0
Master node sending 0 25000 to node 0 thread 0
Slave node 0 thread 0 waiting
Slave node 0 thread 0 received 0 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 0 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0
Master node 0 waiting
Slave node 0 thread 0 offset 0 length 25000 events 1 time 0.36 seconds
Slave node 0 thread 0 finished 0 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 25000 25000 to node 0 thread 0
Master finished. Starting exit procedure...
Slave node 0 thread 0 received 25000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0
Slave node 0 thread 0 offset 25000 length 25000 events 1 time 0.23 seconds
Slave node 0 thread 0 finished 25000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Sending exit message to node 0 thread 0
Slave node 0 thread 0 received -1 -1
WALL time for GalaXYZ kernel = 0.6 seconds
MPI WALL time for GalaXYZ kernel = 0.6 seconds
CPU time for GalaXYZ kernel = 0.6 seconds
Doubling DD angle histogram..., histogram count = 1422090528
Calculated = 711020264
>=256 = 538954736
Total = 1249975000
DR angle histogram count = 194504329
Calculated = 194504329
>=256 = 2305495671
Total = 2500000000
Doubling RR angle histogram..., histogram count = 18528234
Calculated = 9239117
>=256 = 1240735883
Total = 1249975000
------------------------------------------------------------------------------------------------------------------------------------------------
well, let me take a look...
Hi Yurtesen,
I took a quick look but don't have openmp, but I think the timing and hangup problems are probably different.
One question about the hangup, does it get better if you use a workgroup size of 64, or is it the same? The open clmemtest problem was bad on the 7970 because the 7970's wavefronts can be fairly independent, a workgroup size of 64 keeps each workgroup at one wavefront.
Drallan, you will also need MPI but I can provide you a binary if you like? The program runs a master process which tells to slave process(es) which ranges to execute. (this is needed when it is run on a multi-gpu multi-node cluster).
It works fine if I enqueue the whole range in one go. Do you think that sort of problem might appear if I enqueue kernel several times with non-overlapping, increasing ranges? The input and output data are totally independent of each other in the program.
Never the less, I set the workgroup size to 64 and still the same problem.
Total time for GalaXYZ input data MPI_Bcast = 0.0 seconds Real 4309320 Sim 4309320 Hist 257
Getting total number of worker threads
Total number of worker threads 1
Master node sending 0 25000 to node 0 thread 0
Slave node 0 thread 0 offset 0 length 25000 events 1 time 18.06 seconds
Master node sending 25000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 25000 length 25000 events 1 time 18.29 seconds
Master node sending 50000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 50000 length 25000 events 1 time 18.42 seconds
Master node sending 75000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 75000 length 25000 events 1 time 18.32 seconds
Master node sending 100000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 100000 length 25000 events 1 time 18.76 seconds
Master node sending 125000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 125000 length 25000 events 1 time 18.94 seconds
Master node sending 150000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 150000 length 25000 events 1 time 19.68 seconds
Master node sending 175000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 175000 length 25000 events 1 time 19.48 seconds
Master node sending 200000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 200000 length 25000 events 1 time 19.14 seconds
Master node sending 225000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 225000 length 25000 events 1 time 19.17 seconds
Master node sending 250000 25000 to node 0 thread 0
Slave node 0 thread 0 offset 250000 length 25000 events 1 time 18.19 seconds
Master node sending 275000 25000 to node 0 thread 0
Thats it... it stucks and error is:
[ 8173.163437] [fglrx] ASIC hang happened
[ 8173.163446] Pid: 2511, comm: reference Tainted: P 3.0.0-23-generic #39-Ubuntu
[ 8173.163451] Call Trace:
[ 8173.163523] [<ffffffffa011a0ce>] KCL_DEBUG_OsDump+0xe/0x10 [fglrx]
[ 8173.163574] [<ffffffffa01275ac>] firegl_hardwareHangRecovery+0x1c/0x30 [fglrx]
[ 8173.163669] [<ffffffffa01a0a59>] ? _ZN4Asic9WaitUntil15ResetASICIfHungEv+0x9/0x10 [fglrx]
[ 8173.163761] [<ffffffffa01a09fc>] ? _ZN4Asic9WaitUntil15WaitForCompleteEv+0x9c/0xf0 [fglrx]
[ 8173.163865] [<ffffffffa01b1301>] ? _ZN4Asic19PM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0x141/0x160 [fglrx]
[ 8173.163923] [<ffffffffa01464a2>] ? firegl_trace+0x72/0x1e0 [fglrx]
[ 8173.163980] [<ffffffffa01464a2>] ? firegl_trace+0x72/0x1e0 [fglrx]
[ 8173.164082] [<ffffffffa01a82a3>] ? _ZN15QS_PRIVATE_CORE27multiVpuPM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0x33/0x50 [fglrx]
[ 8173.164226] [<ffffffffa019ffb9>] ? _Z15uQSPM4TimestampmP20_QS_PM4_TS_PACKET_IN+0x69/0x70 [fglrx]
[ 8173.164326] [<ffffffffa019b31d>] ? _Z8uCWDDEQCmjjPvjS_+0x5dd/0x10c0 [fglrx]
[ 8173.164340] [<ffffffff8108747e>] ? down+0x2e/0x50
[ 8173.164405] [<ffffffffa0149baf>] ? firegl_cmmqs_CWDDE_32+0x36f/0x480 [fglrx]
[ 8173.164469] [<ffffffffa014829e>] ? firegl_cmmqs_CWDDE32+0x6e/0x100 [fglrx]
[ 8173.164483] [<ffffffff8128559a>] ? security_capable+0x2a/0x30
[ 8173.164547] [<ffffffffa0148230>] ? firegl_cmmqs_createdriver+0x170/0x170 [fglrx]
[ 8173.164600] [<ffffffffa01232ad>] ? firegl_ioctl+0x1ed/0x250 [fglrx]
[ 8173.164645] [<ffffffffa01139be>] ? ip_firegl_unlocked_ioctl+0xe/0x20 [fglrx]
[ 8173.164658] [<ffffffff8117a96a>] ? do_vfs_ioctl+0x8a/0x340
[ 8173.164671] [<ffffffff810985da>] ? sys_futex+0x10a/0x1a0
[ 8173.164682] [<ffffffff8117acb1>] ? sys_ioctl+0x91/0xa0
[ 8173.164695] [<ffffffff815fd402>] ? system_call_fastpath+0x16/0x1b
[ 8173.164707] pubdev:0xffffffffa0335c80, num of device:1 , name:fglrx, major 8, minor 98.
[ 8173.164718] device 0 : 0xffff88042491c000 .
[ 8173.164727] Asic ID:0x6798, revision:0x5, MMIOReg:0xffffc90015300000.
[ 8173.164737] FB phys addr: 0xc0000000, MC :0xf400000000, Total FB size :0xc0000000.
[ 8173.164746] gart table MC:0xf40f8fd000, Physical:0xcf8fd000, size:0x402000.
[ 8173.164755] mc_node :FB, total 1 zones
[ 8173.164763] MC start:0xf400000000, Physical:0xc0000000, size:0xfd00000.
[ 8173.164773] Mapped heap -- Offset:0x0, size:0xf8fd000, reference count:19, mapping count:0,
[ 8173.164785] Mapped heap -- Offset:0x0, size:0x1000000, reference count:1, mapping count:0,
[ 8173.164795] Mapped heap -- Offset:0xf8fd000, size:0x403000, reference count:1, mapping count:0,
[ 8173.164805] mc_node :INV_FB, total 1 zones
[ 8173.164813] MC start:0xf40fd00000, Physical:0xcfd00000, size:0xb0300000.
[ 8173.164823] Mapped heap -- Offset:0x2f8000, size:0x8000, reference count:1, mapping count:0,
[ 8173.164834] Mapped heap -- Offset:0xb02f4000, size:0xc000, reference count:1, mapping count:0,
[ 8173.164845] mc_node :GART_USWC, total 3 zones
[ 8173.164852] MC start:0xffa0100000, Physical:0x0, size:0x50000000.
[ 8173.164862] Mapped heap -- Offset:0x0, size:0x2000000, reference count:16, mapping count:0,
[ 8173.164872] mc_node :GART_CACHEABLE, total 3 zones
[ 8173.164881] MC start:0xff70400000, Physical:0x0, size:0x2fd00000.
[ 8173.164890] Mapped heap -- Offset:0xc00000, size:0x100000, reference count:2, mapping count:0,
[ 8173.164901] Mapped heap -- Offset:0xb00000, size:0x100000, reference count:1, mapping count:0,
[ 8173.164912] Mapped heap -- Offset:0x200000, size:0x900000, reference count:3, mapping count:0,
[ 8173.164923] Mapped heap -- Offset:0x0, size:0x200000, reference count:5, mapping count:0,
[ 8173.164934] Mapped heap -- Offset:0xef000, size:0x11000, reference count:1, mapping count:0,
[ 8173.164945] GRBM : 0xa0407028, SRBM : 0x200000c0 .
[ 8173.164956] CP_RB_BASE : 0xffa01000, CP_RB_RPTR : 0x7330 , CP_RB_WPTR :0x7330.
[ 8173.164967] CP_IB1_BUFSZ:0x0, CP_IB1_BASE_HI:0xff, CP_IB1_BASE_LO:0xa0851000.
[ 8173.164976] last submit IB buffer -- MC :0xffa0851000,phys:0x4ece000.
[ 8173.164992] Dump the trace queue.
[ 8173.164999] End of dump
In this case, it took about 200-220seconds(after 11 enqueues) for the problem to appear. If I increase the range sent to slave to 50000 then it takes about 270-300 seconds(after 8 enqueues).
so can you provide a link to the MPI binary?
No, I meant the whole program compiled. But I am not sure what that might help. You can get MPI freely on any Linux operating system easily. Use mpich2 for example (you can also use your favourite package manager apt-get, yum etc. to install it easily on your system). :
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads
Also, OpenMP is part of GCC (I just remembered) so I am not sure how come drallan does not have it
http://gcc.gnu.org/wiki/openmp
As of GCC 4.2, the compiler implements version 2.5 of the OpenMP standard and as of 4.4 it implements version 3.0 of the OpenMP standard. The OpenMP 3.1 is supported since GCC 4.7.
oh. ok. I'd like to have a copy of the binary if you don't mind. And I am gonna install mpich2 in order to repeat what you've seen.
Also, OpenMP is part of GCC (I just remembered) so I am not sure how come drallan does not have it
http://gcc.gnu.org/wiki/openmp
Is it? then I should have it. Guess I was too busy writing AMD assembly. Take a look tomorrow.
drallan wrote:
Is it? then I should have it. Guess I was too busy writing AMD assembly.
Take a look tomorrow.
I would appreciate it a lot, but you will still need MPI to be able to compile the program...
This does sound familiar! We have no probs with a single Cayman (or 3 x GTX580 using mpich2). But link with OpenMPI and the proc on a single Tahiti box dies. Easy fix for us was to link with mpich2 on the Tahiti box.
nnunn@ausport.gov.au wrote:
This does sound familiar! We have no probs with a single Cayman (or 3 x GTX580 using mpich2). But link with OpenMPI and the proc on a single Tahiti box dies. Easy fix for us was to link with mpich2 on the Tahiti box.
I dont quite understand what you mean? What do you mean by "dies" exactly?
Also, we were talking with OpenMP not OpenMPI and I used OpenMP to be able to enqueue to multiple devices concurrently using multiple contexts using threads. Although, the problem occurs with a single box with single gpu also.
sh run.sh
rm -f *.o reference
mpicc -Wall -O3 -g -I./include/ -I/usr/include/mpich2-x86_64 -I. -fopenmp -c -o reference.o reference.cpp
mpicc -Wall -O3 -g -I./include/ -I/usr/include/mpich2-x86_64 -I. -fopenmp -c -o opencl.o opencl.cpp
mpicc -o reference reference.o opencl.o -I./include/ -I/usr/include/mpich2-x86_64 -I. -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich -lmpl -lgomp
1 platform found:
-------------------------------------------------------------------------------
platform 0*:
name: AMD Accelerated Parallel Processing
profile: FULL_PROFILE
version: OpenCL 1.2 AMD-APP (938.1)
vendor: Advanced Micro Devices, Inc.
extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
-------------------------------------------------------------------------------
* - Selected
Devices of type GPU :
-------------------------------------------------------------------------------
0* Cayman
1 Cayman
-------------------------------------------------------------------------------
* - Selected
Device 0 log:
"/tmp/OCLxkgZoq.cl", line 150: warning: null (zero) character in input line
ignored
^
Warning: galaxyz kernel has register spilling. Lower performance is expected.
../data/m.txt contains 430932 lines
first item: 52.660000 10.900000
last item: 86.260002 8.090000
../data/m_r.txt contains 430932 lines
first item: 76.089050 32.209370
last item: 27.345739 38.801189
Total time for GalaXYZ input data MPI_Bcast = 0.0 seconds
Real 430932 Sim 430932 Hist 257
Getting total number of worker threads
Total number of worker threads 1
Slave node 0 thread 0 sending 0
Master node 0 waiting
Master node 0 received id 0 thread 0
Master node sending 0 25000 to node 0 thread 0
Slave node 0 thread 0 waiting
Slave node 0 thread 0 received 0 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 0 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0
Master node 0 waiting
Slave node 0 thread 0 offset 0 length 25000 events 1 time 5.85 seconds
Slave node 0 thread 0 finished 0 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 25000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 25000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0
Slave node 0 thread 0 offset 25000 length 25000 events 1 time 5.58 seconds
Slave node 0 thread 0 finished 25000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 50000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 50000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 50000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 75000, remaining lines 0
Slave node 0 thread 0 offset 50000 length 25000 events 1 time 5.36 seconds
Slave node 0 thread 0 finished 50000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 75000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 75000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 75000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 100000, remaining lines 0
Slave node 0 thread 0 offset 75000 length 25000 events 1 time 5.15 seconds
Slave node 0 thread 0 finished 75000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 100000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 100000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 100000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 125000, remaining lines 0
Slave node 0 thread 0 offset 100000 length 25000 events 1 time 4.95 seconds
Slave node 0 thread 0 finished 100000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 125000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 125000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 125000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 150000, remaining lines 0
Slave node 0 thread 0 offset 125000 length 25000 events 1 time 4.70 seconds
Slave node 0 thread 0 finished 125000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 150000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 150000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 150000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 175000, remaining lines 0
Slave node 0 thread 0 offset 150000 length 25000 events 1 time 4.46 seconds
Slave node 0 thread 0 finished 150000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 175000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 175000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 175000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 200000, remaining lines 0
Slave node 0 thread 0 offset 175000 length 25000 events 1 time 4.23 seconds
Slave node 0 thread 0 finished 175000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 200000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 200000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 200000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 225000, remaining lines 0
Slave node 0 thread 0 offset 200000 length 25000 events 1 time 3.99 seconds
Slave node 0 thread 0 finished 200000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 225000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 225000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 225000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 250000, remaining lines 0
Slave node 0 thread 0 offset 225000 length 25000 events 1 time 3.73 seconds
Slave node 0 thread 0 finished 225000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 250000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 250000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 250000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 275000, remaining lines 0
Slave node 0 thread 0 offset 250000 length 25000 events 1 time 3.47 seconds
Slave node 0 thread 0 finished 250000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 275000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 275000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 275000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 300000, remaining lines 0
Slave node 0 thread 0 offset 275000 length 25000 events 1 time 3.25 seconds
Slave node 0 thread 0 finished 275000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 300000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 300000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 300000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 325000, remaining lines 0
Slave node 0 thread 0 offset 300000 length 25000 events 1 time 3.02 seconds
Slave node 0 thread 0 finished 300000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 325000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 325000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 325000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 350000, remaining lines 0
Slave node 0 thread 0 offset 325000 length 25000 events 1 time 2.79 seconds
Slave node 0 thread 0 finished 325000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 350000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 350000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 350000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 375000, remaining lines 0
Slave node 0 thread 0 offset 350000 length 25000 events 1 time 2.55 seconds
Slave node 0 thread 0 finished 350000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 375000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 375000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 375000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 400000, remaining lines 0
Slave node 0 thread 0 offset 375000 length 25000 events 1 time 2.31 seconds
Slave node 0 thread 0 finished 375000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 400000 25000 to node 0 thread 0
Master node 0 waiting
Slave node 0 thread 0 received 400000 25000
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 400000 global 25088 with vector size 1
Node/Dev 0/0: First kernel processes 25000 lines with localmax 425000, remaining lines 0
Slave node 0 thread 0 offset 400000 length 25000 events 1 time 2.07 seconds
Slave node 0 thread 0 finished 400000 25000
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Master node 0 received id 0 thread 0
Master node sending 425000 5932 to node 0 thread 0
Master finished. Starting exit procedure...
Slave node 0 thread 0 received 425000 5932
Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0
Node/Dev 0/0: Maximum workgroup size 256
Node/Dev 0/0: Using workgroup size 256
Node/Dev 0/0: Using offset 425000 global 6144 with vector size 1
Node/Dev 0/0: First kernel processes 5932 lines with localmax 430932, remaining lines 0
Slave node 0 thread 0 offset 425000 length 5932 events 1 time 0.49 seconds
Slave node 0 thread 0 finished 425000 5932
Slave node 0 thread 0 sending 0
Slave node 0 thread 0 waiting
Sending exit message to node 0 thread 0
Slave node 0 thread 0 received -1 -1
WALL time for GalaXYZ kernel = 68.1 seconds
MPI WALL time for GalaXYZ kernel = 68.1 seconds
CPU time for GalaXYZ kernel = 68.1 seconds
Doubling DD angle histogram..., histogram count = 114799465750
Calculated = 57399517409
>=256 = 35451461437
Total = 92850978846
DR angle histogram count = 98903437674
Calculated = 98903437674
>=256 = 86798950950
Total = 185702388624
Doubling RR angle histogram..., histogram count = 94429502254
Calculated = 47214535661
>=256 = 45636443185
Total = 92850978846
27.62user 41.80system 1:09.43elapsed 99%CPU (0avgtext+0avgdata 258208maxresident)k
24696inputs+944outputs (1major+21819minor)pagefaults 0swaps
So the lockup problem didn't occur on Caryman as u can see above, right? I am using ubuntu/sdk2.7, btw.
binying wrote:
So the lockup problem didn't occur on Caryman as u can see above, right? I am using ubuntu/sdk2.7, btw.
Thanks for testing it. It also doesnt occur with Cypress. It only locks up with Tahiti (I recommend testing it with Tahiti if you have one.). The program seem to work fine on anything else than Tahiti as far as I can tell.
Also you can have a go with the larger data set it takes much longer to process each piece with it.
./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt
The question is, is this a driver/sdk bug on Tahiti? If yes, how can I get AMD to take action???
I'll ask a Tahiti person to jump in to this discussion
Update: an internal ticket has been filed to put this issue in the queue for additional research
Thank you Kristen. Please let me know if AMD needs more information or anything else for fixing this issue.
I have been having a similar problem with OpenMP.
liwoog, openmp is not mentioned in the thread you linked?
True, I failed to mention it there. But I am using OpenMP.
liwoog wrote:
True, I failed to mention it there. But I am using OpenMP.
I think I had a version of the code which did not use OpenMP. I will check that out and return back but I am not sure how that can cause the problem that I am having... we shall see...
I probably disabled OpenMP at some point too, just to check why the GPU was hanging. Hence why I did not mention it in my post. I read in another post that the hang was due to killing a process that had allocated more than 256MB on the card. I mostly gave up on the cards because of it.
One more thing to add. Running through the AMD Profiler seems to prevents the program from hanging at runtime. It will still hang if interrupted.
I am not killing any processes. I will do some changes in the code and make a non open-mp version and also I think I might have an idea about something else to test. I will let you guys know when I get around to test them. Still, it is strange that only Tahiti crashes....
First thought was a threading issue (this is why I mentioned our experience with OpenMPI). If both OpenMP and OpenMPI are causing issues for Tahiti (but not Cayman or Cypress, and mpich2 just works), maybe some 'feature' in the way both OpenMP and OpenMPI handle their threads is interfering with Tahiti optimizations?
I made a non OpenMP / MPI version and it is still crashing only on Tahiti exactly the same way. I think the issue is not related to OpenMP / MPI
That will certainly make it easier!
Can you post this version of the code? I'll see what happens in the windows environment.
drallan
Yes, but I have to warn, it is an ugly hack of a different version of the code and output is not exactly same etc. Let me know if you have any problems and thank you!
By the way the execution steps probably overlap a little bit due to steps are not exact multiple of workgroup size and I didnt put an if statement inside the code which exits execution if the thread id is over the last item in the step. For example for 0 to 25000, the global id 25002 can be executed even thuogh it is larger than 25000, in next step it is executed again. But it just means some stuff is calculated twice (and results are wrong, but for test program I dont care). Anyway, the point was that it shouldnt be helping the gpu to crash and burn
Beautiful ugly hack, it only took a few minutes to get to compile. I'll try running it tomorrow.
1. Are you compiling a 64 bit version?
I can compile either way but mingw requires "long long" for a 64 bit variable in either 32 or 64 bit mode.
Something to do with windows.
drallan
I am using 64bit and compile on Ubuntu 11.10. I guess one other thing to look at is if you are getting the same results at each run.
I thought you would get exactly 10 times more with 4.3m case compared to 430k case but the version of the program you have is a quick hack and some elements are processed twice so this probably is not the case for you. But the first program I posted should perhaps do 10 times larger results only.
i have tried this on newer internal drivers on ubuntu 10.04 64bit and i was able to loop the 4.3million version over 72hours (68 loops). So i hope a future driver will fix the issue for you.
also as a side question if i compare the out.txt from run to run should i see differences?
Well, it is difficult to say if the results should be same or not. It depends if the hardware or threads are somehow re-ordering operations with FP numbers at each run. Can that happen? I am not an expert on what OpenCL does internally... If yes, due to differences in the order of operations, slight differences can occur. If not... let me know
Do you mean that you re-ran 4.3million version over and over for 72 hours? Thanks for that (I just ask because a single run shouldnt take that long).
I was just working on a version "without" MPI and OpenMP just for testing this issue and I might be able to finish it tomorrow maybe. Do you know what was the problem exactly?
Yes i was running the 4.3 million version over and over for a total of 72hours.
I have no idea if the operations will be re-ordered or not i am only executing your code and testing if it fails. This was more of a personal exercise and also from how you explain your code shouldn't have caused hangs in the first place.
I have been running my code on the latest available drivers and while the code runs fine with one set of parameters, it hangs with another in which the kernels take longer to run. I believe that something happens after a kernel runs for over 5min.
Thanks for testing, but how do I get the latest internal drivers for myself?
About slightly different results, I am getting same results on Cypress (I think I got different results earlier today because the card was quite a bit overclocked ). I will re-test on Tahiti later tomorrow (hopefully), but results should be same I think.
Sorry for the earlier confusing comment I made about different results. I am now getting same results at factory clocks also
hi Yurtesen
when i run the 50k input file i get consistent results every time, even when i run the m.txt and m_r.txt input files i get consistent output results between runs but when i run the 4.3 million file i get different outputs run to run. *scratching head*
is there any expected output file i can compare the output data to and see what is going on. Can you provide one?
What about the 430k file? I was testing with 430k file? I tried it on Tahiti and it gives consistent results on Tahiti as well.
But you seem to be correct, I seem to be getting different results with 4.3m file. I will run it on some nvidia cards and amd cypress, and let you know if I get different results there also. It might take a few days.
Header 1 | Header 2 | Header 3 | Header 4 | Header 5 | Header 6 | Header 7 | Header 8 | Header 9 | Header 10 | Header 11 | Header 12 | Header 13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
match | 2 | 3 | 4 | 5 | 6 | 8 | 9 | 10 | 12 | 14 | 16 | 17 |
diff from above | 1 | 7 | 11 | 13 | 15 | |||||||
sub match | 1 | 11 | 13 |
hsaigol, I will run it on some other devices, nvidia, cpu etc. and return back to you. I believe the 4.3m file results should be exactly 10 times more than 430k results, but it appears it is rarely the case.
i'll wait for your reply, if i get a chance i will try on a 78xx gpu as well and see what happens.
also what version of the driver are you using
can you open CCC and check under the information tab the exact information for the driver, more specifically driver packaging version. thanks
Meanwhile, just out of curiosity, if possible, can you have a look at the version I attached without openmp / mpi ? that might be a better test case since it is less complicated ?
Hi Yurtesen,
I have run the 430932 size case ~50 times and do not see any hang, I've run the huge file a couple of times and see no hang.
What I do see is excellent Tahiti performance, have you timed your new (non-MPI) program?
Tahiti is running about 4.5X faster than my Cayman. Tahiti 10.1 seconds vs Cayman 46.9 seconds.
Tahiti huge problem about 1003 seconds, (100X for N*N problem)
The 'huge' output numbers are roughly 100 times larger, as expected, with slight differences after dividing by 100.
So, I see nothing unusual so far, though I am curious about your run time for the new code.
Also, are you still seeing register spilling?
BTW, I'm running the Tahitis at 1200 Mhz.
TAHITI
-------------------------------------------------------------------------------
Real 430932 Sim 430932 Hist 257
Using workgroup size 256
Using global size 431104
Running OpenCL GalaXYZ
Queueing part 0 - 25000 of 431104... Kernel finished 1.531
Queueing part 25000 - 50000 of 431104... Kernel finished 0.870
Queueing part 50000 - 75000 of 431104... Kernel finished 0.799
[.....]
Completed OpenCL GalaXYZ
WALL time for GalaXYZ kernel = 10.1 seconds
CPU time for GalaXYZ kernel = 10.1 seconds
Doubling DD angle histogram..., histogram count = 169741846286
Calculated = 84870707677
>=256 = 0
Total = 84870707677
DR angle histogram count = 169146070638
Calculated = 169146070638
>=256 = 0
Total = 169146070638
Doubling RR angle histogram..., histogram count = 168527020850
Calculated = 84263294959
>=256 = 0
Total = 84263294959
CAYMAN
-------------------------------------------------------------------------------
Real 430932 Sim 430932 Hist 257
Using workgroup size 256
Using global size 431104
Running OpenCL GalaXYZ
Queueing part 0 - 25000 of 431104... Kernel finished 1.521
Queueing part 25000 - 50000 of 431104... Kernel finished 4.133
Queueing part 50000 - 75000 of 431104... Kernel finished 3.874
Queueing part 75000 - 100000 of 431104... Kernel finished 3.816
Queueing part 100000 - 125000 of 431104... Kernel finished 3.689
[.....]
Completed OpenCL GalaXYZ
WALL time for GalaXYZ kernel = 46.9 seconds
CPU time for GalaXYZ kernel = 46.9 seconds
Doubling DD angle histogram..., histogram count = 169741846286
Calculated = 84870707677
>=256 = 0
Total = 84870707677
DR angle histogram count = 169146070638
Calculated = 169146070638
>=256 = 0
Total = 169146070638
Doubling RR angle histogram..., histogram count = 168527020850
Calculated = 84263294959
>=256 = 0
Total = 84263294959
-------------------------------------------------------------------------------