83 Replies Latest reply on Mar 20, 2013 4:54 AM by himanshu.gautam

    Tahiti 7970 lockup no problem in 5870 or Nvidia devices...

    yurtesen

      I have a very simple program. The data is transferred into GPU memory in the beginning of the

      program, and the main program just queues kernel runs with different starting offsets (waits for

      previous run to finish before queing another run). When the whole range is executed, it reads the

      results from the card.

       

      It is sad because AMD's GCN and even older cards beat Nvidia counterparts greatly in performance of these calculations, yet something simply do not function properly so we are forced to use Nvidia hardware

       

      Problem1:

      The program enqueues a kernel run, wait for it to finish and enqueue another. There is significant time when starting kernel runs (on Tahiti) even though there shouldnt be any data transfer at all.

       

      A case with global size of 430932 (rounded up to 431104) takes 36.7 seconds

      to run when the kernel is enqueued once. If the kernel is enqueued with global size of 50000 and using offsets (rounded to 50176) and run in 9 pieces, the total runtime is 44.2 seconds. The overhead is almost a second per kernel enqueue on Tahiti. On Cypress the difference is only 42.5 vs 44 seconds

       

      Note: In the program, block size is set from defs.h file using variable WORKSIZE

       

       

      Problem2:

      This works fine for example if the whole range was executed all at once. Lets say 0 to 4000000 all at once,

      but if I try 0-50000, 50000-100000, 150000-200000 etc. then after 8-10 enqueues it gets stuck and the only

      way to recover is rebooting the box. (yes the global size was rounded up to multiple of worksize)

       

      This happens only with Tahiti and NOT with for example Cypress (5870) or Nvidia Tesla cards.

       

      I am providing the source code if anybody wants to have a look:

       

      Program:

      http://users.abo.fi/eyurtese/amd/galaxyz.tgz

      Data Files:

      http://users.abo.fi/eyurtese/amd/galaxy_data.tgz

       

       

      The program includes a small Makefile, it should be easy to run (might require editing). If you

      have any problems, please let me know.

       

      Variables for selecting the card etc. are stored in defs.h

       

      The program is ran using the following command line: (use the correct paths to data files)

       

      430k test cases:

      ./reference ../data/m.txt ../data/m_r.txt out.txt

      or

      ./reference ../data/m_s.txt ../data/m_r_s.txt out.txt

       

      4300k test cases:

      ./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt

      or

      ./reference ../data/m_huge_s.txt ../data/m_huge_r_s.txt out.txt

       

      The difference between normal files and _s files are the data is shuffled with the _s files which improves performance slightly.(not relevant to the problems). But

      you can use any test you like. There are also 50K sized test files which I use for very quick runs only.

       

      an example run output is below, and under that are the problems listed:

       

      ------------------------------------------------------------------------------------------------------------------------------------------------

      1 platform found:

      -------------------------------------------------------------------------------

      platform 0*:

              name: AMD Accelerated Parallel Processing

              profile: FULL_PROFILE

              version: OpenCL 1.2 AMD-APP (923.1)

              vendor: Advanced Micro Devices, Inc.

              extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

      -------------------------------------------------------------------------------

      * - Selected

       

      Devices of type GPU :

      -------------------------------------------------------------------------------

      0*      Cypress

      1       Tahiti

      -------------------------------------------------------------------------------

      * - Selected

      Device 0 log:

      "/tmp/OCLyYW4br.cl", line 150: warning: null (zero) character in input line

                ignored

       

        ^

       

      Warning: galaxyz kernel has register spilling. Lower performance is expected.

       

       

      ../data/50k.txt contains 50000 lines

           first item: 52.660000 10.900000

            last item: 10.620000 40.070000

      ../data/50k_r.txt contains 50000 lines

           first item: 76.089050 32.209370

            last item: 80.482910 22.944120

       

      Total time for GalaXYZ input data MPI_Bcast =    0.0 seconds

      Real 50000 Sim 50000 Hist 257

      Getting total number of worker threads

      Total number of worker threads 1

      Slave node 0 thread 0 sending 0

      Master node 0 waiting

      Master node 0 received id 0 thread 0

      Master node sending 0 25000 to node 0 thread 0

      Slave node 0 thread 0 waiting

      Slave node 0 thread 0 received 0 25000

       

      Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

      Node/Dev 0/0: Maximum workgroup size 256

      Node/Dev 0/0: Using workgroup size 256

      Node/Dev 0/0: Using offset 0 global 25088 with vector size 1

      Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0

      Master node 0 waiting

      Slave node 0 thread 0 offset 0 length 25000 events 1 time 0.36 seconds

      Slave node 0 thread 0 finished 0 25000

      Slave node 0 thread 0 sending 0

      Slave node 0 thread 0 waiting

      Master node 0 received id 0 thread 0

      Master node sending 25000 25000 to node 0 thread 0

      Master finished. Starting exit procedure...

      Slave node 0 thread 0 received 25000 25000

       

      Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

      Node/Dev 0/0: Maximum workgroup size 256

      Node/Dev 0/0: Using workgroup size 256

      Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1

      Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0

      Slave node 0 thread 0 offset 25000 length 25000 events 1 time 0.23 seconds

      Slave node 0 thread 0 finished 25000 25000

      Slave node 0 thread 0 sending 0

      Slave node 0 thread 0 waiting

      Sending exit message to node 0 thread 0

      Slave node 0 thread 0 received -1 -1

       

      WALL time for GalaXYZ kernel =    0.6 seconds

      MPI WALL time for GalaXYZ kernel =    0.6 seconds

      CPU time for GalaXYZ kernel  =    0.6 seconds

       

      Doubling DD angle histogram...,  histogram count = 1422090528

                                       Calculated      = 711020264

                                       >=256           = 538954736

                                       Total           = 1249975000

       

      DR angle                         histogram count = 194504329

                                       Calculated      = 194504329

                                       >=256           = 2305495671

                                       Total           = 2500000000

       

      Doubling RR angle histogram...,  histogram count = 18528234

                                       Calculated      = 9239117

                                       >=256           = 1240735883

                                       Total           = 1249975000

       

      ------------------------------------------------------------------------------------------------------------------------------------------------

        • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
          binying

          well, let me take a look...

          • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
            drallan

            Hi Yurtesen,

             

            I took a quick look but don't have openmp, but I think the timing and hangup problems are probably different.

            One question about the hangup, does it get better if you use a workgroup size of 64, or is it the same? The open clmemtest problem was bad on  the 7970  because the 7970's wavefronts can be fairly independent, a workgroup size of 64 keeps each workgroup at one wavefront.

              • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                yurtesen

                Drallan, you will also need MPI but I can provide you a binary if you like? The program runs a master process which tells to slave process(es) which ranges to execute. (this is needed when it is run on a multi-gpu multi-node cluster).

                 

                It works fine if I enqueue the whole range in one go. Do you think that sort of problem might appear if I enqueue kernel several times with non-overlapping, increasing ranges? The input and output data are totally independent of each other in the program.

                 

                Never the less, I set the workgroup size to 64 and still the same problem.

                 

                Total time for GalaXYZ input data MPI_Bcast =0.0 seconds

                Real 4309320 Sim 4309320 Hist 257

                Getting total number of worker threads

                Total number of worker threads 1

                Master node sending 0 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 0 length 25000 events 1 time 18.06 seconds

                Master node sending 25000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 25000 length 25000 events 1 time 18.29 seconds

                Master node sending 50000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 50000 length 25000 events 1 time 18.42 seconds

                Master node sending 75000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 75000 length 25000 events 1 time 18.32 seconds

                Master node sending 100000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 100000 length 25000 events 1 time 18.76 seconds

                Master node sending 125000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 125000 length 25000 events 1 time 18.94 seconds

                Master node sending 150000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 150000 length 25000 events 1 time 19.68 seconds

                Master node sending 175000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 175000 length 25000 events 1 time 19.48 seconds

                Master node sending 200000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 200000 length 25000 events 1 time 19.14 seconds

                Master node sending 225000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 225000 length 25000 events 1 time 19.17 seconds

                Master node sending 250000 25000 to node 0 thread 0

                Slave node 0 thread 0 offset 250000 length 25000 events 1 time 18.19 seconds

                Master node sending 275000 25000 to node 0 thread 0

                 

                Thats it... it stucks and error is:

                 

                [ 8173.163437] [fglrx] ASIC hang happened

                [ 8173.163446] Pid: 2511, comm: reference Tainted: P            3.0.0-23-generic #39-Ubuntu

                [ 8173.163451] Call Trace:

                [ 8173.163523]  [<ffffffffa011a0ce>] KCL_DEBUG_OsDump+0xe/0x10 [fglrx]

                [ 8173.163574]  [<ffffffffa01275ac>] firegl_hardwareHangRecovery+0x1c/0x30 [fglrx]

                [ 8173.163669]  [<ffffffffa01a0a59>] ? _ZN4Asic9WaitUntil15ResetASICIfHungEv+0x9/0x10 [fglrx]

                [ 8173.163761]  [<ffffffffa01a09fc>] ? _ZN4Asic9WaitUntil15WaitForCompleteEv+0x9c/0xf0 [fglrx]

                [ 8173.163865]  [<ffffffffa01b1301>] ? _ZN4Asic19PM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0x141/0x160 [fglrx]

                [ 8173.163923]  [<ffffffffa01464a2>] ? firegl_trace+0x72/0x1e0 [fglrx]

                [ 8173.163980]  [<ffffffffa01464a2>] ? firegl_trace+0x72/0x1e0 [fglrx]

                [ 8173.164082]  [<ffffffffa01a82a3>] ? _ZN15QS_PRIVATE_CORE27multiVpuPM4ElapsedTimeStampEj14_LARGE_INTEGER12_QS_CP_RING_+0x33/0x50 [fglrx]

                [ 8173.164226]  [<ffffffffa019ffb9>] ? _Z15uQSPM4TimestampmP20_QS_PM4_TS_PACKET_IN+0x69/0x70 [fglrx]

                [ 8173.164326]  [<ffffffffa019b31d>] ? _Z8uCWDDEQCmjjPvjS_+0x5dd/0x10c0 [fglrx]

                [ 8173.164340]  [<ffffffff8108747e>] ? down+0x2e/0x50

                [ 8173.164405]  [<ffffffffa0149baf>] ? firegl_cmmqs_CWDDE_32+0x36f/0x480 [fglrx]

                [ 8173.164469]  [<ffffffffa014829e>] ? firegl_cmmqs_CWDDE32+0x6e/0x100 [fglrx]

                [ 8173.164483]  [<ffffffff8128559a>] ? security_capable+0x2a/0x30

                [ 8173.164547]  [<ffffffffa0148230>] ? firegl_cmmqs_createdriver+0x170/0x170 [fglrx]

                [ 8173.164600]  [<ffffffffa01232ad>] ? firegl_ioctl+0x1ed/0x250 [fglrx]

                [ 8173.164645]  [<ffffffffa01139be>] ? ip_firegl_unlocked_ioctl+0xe/0x20 [fglrx]

                [ 8173.164658]  [<ffffffff8117a96a>] ? do_vfs_ioctl+0x8a/0x340

                [ 8173.164671]  [<ffffffff810985da>] ? sys_futex+0x10a/0x1a0

                [ 8173.164682]  [<ffffffff8117acb1>] ? sys_ioctl+0x91/0xa0

                [ 8173.164695]  [<ffffffff815fd402>] ? system_call_fastpath+0x16/0x1b

                [ 8173.164707] pubdev:0xffffffffa0335c80, num of device:1 , name:fglrx, major 8, minor 98.

                [ 8173.164718] device 0 : 0xffff88042491c000 .

                [ 8173.164727] Asic ID:0x6798, revision:0x5, MMIOReg:0xffffc90015300000.

                [ 8173.164737] FB phys addr: 0xc0000000, MC :0xf400000000, Total FB size :0xc0000000.

                [ 8173.164746] gart table MC:0xf40f8fd000, Physical:0xcf8fd000, size:0x402000.

                [ 8173.164755] mc_node :FB, total 1 zones

                [ 8173.164763]     MC start:0xf400000000, Physical:0xc0000000, size:0xfd00000.

                [ 8173.164773]     Mapped heap -- Offset:0x0, size:0xf8fd000, reference count:19, mapping count:0,

                [ 8173.164785]     Mapped heap -- Offset:0x0, size:0x1000000, reference count:1, mapping count:0,

                [ 8173.164795]     Mapped heap -- Offset:0xf8fd000, size:0x403000, reference count:1, mapping count:0,

                [ 8173.164805] mc_node :INV_FB, total 1 zones

                [ 8173.164813]     MC start:0xf40fd00000, Physical:0xcfd00000, size:0xb0300000.

                [ 8173.164823]     Mapped heap -- Offset:0x2f8000, size:0x8000, reference count:1, mapping count:0,

                [ 8173.164834]     Mapped heap -- Offset:0xb02f4000, size:0xc000, reference count:1, mapping count:0,

                [ 8173.164845] mc_node :GART_USWC, total 3 zones

                [ 8173.164852]     MC start:0xffa0100000, Physical:0x0, size:0x50000000.

                [ 8173.164862]     Mapped heap -- Offset:0x0, size:0x2000000, reference count:16, mapping count:0,

                [ 8173.164872] mc_node :GART_CACHEABLE, total 3 zones

                [ 8173.164881]     MC start:0xff70400000, Physical:0x0, size:0x2fd00000.

                [ 8173.164890]     Mapped heap -- Offset:0xc00000, size:0x100000, reference count:2, mapping count:0,

                [ 8173.164901]     Mapped heap -- Offset:0xb00000, size:0x100000, reference count:1, mapping count:0,

                [ 8173.164912]     Mapped heap -- Offset:0x200000, size:0x900000, reference count:3, mapping count:0,

                [ 8173.164923]     Mapped heap -- Offset:0x0, size:0x200000, reference count:5, mapping count:0,

                [ 8173.164934]     Mapped heap -- Offset:0xef000, size:0x11000, reference count:1, mapping count:0,

                [ 8173.164945] GRBM : 0xa0407028, SRBM : 0x200000c0 .

                [ 8173.164956] CP_RB_BASE : 0xffa01000, CP_RB_RPTR : 0x7330 , CP_RB_WPTR :0x7330.

                [ 8173.164967] CP_IB1_BUFSZ:0x0, CP_IB1_BASE_HI:0xff, CP_IB1_BASE_LO:0xa0851000.

                [ 8173.164976] last submit IB buffer -- MC :0xffa0851000,phys:0x4ece000.

                [ 8173.164992] Dump the trace queue.

                [ 8173.164999] End of dump

                 

                In this case, it took about 200-220seconds(after 11 enqueues) for the problem to appear. If I increase the range sent to slave to 50000 then it takes about 270-300 seconds(after 8 enqueues).

                  • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                    binying

                    so can you provide a link to the MPI binary?

                      • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                        yurtesen

                        No, I meant the whole program compiled. But I am not sure what that might help. You can get MPI freely on any Linux operating system easily. Use mpich2 for example (you can also use your favourite package manager apt-get, yum etc. to install it easily on your system). :

                        http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads

                         

                        Also, OpenMP is part of GCC (I just remembered) so I am not sure how come drallan does not have it

                        http://gcc.gnu.org/wiki/openmp

                        As of GCC 4.2, the compiler implements version 2.5 of the OpenMP standard and as of 4.4 it implements version 3.0 of the OpenMP standard. The OpenMP 3.1 is supported since GCC 4.7.

                          • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                            binying

                            oh. ok. I'd like to have a copy of the binary if you don't mind. And I am gonna install mpich2 in order to repeat what you've seen.

                            • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                              drallan

                              Also, OpenMP is part of GCC (I just remembered) so I am not sure how come drallan does not have it

                              http://gcc.gnu.org/wiki/openmp

                               

                              Is it? then I should have it. Guess I was too busy writing AMD assembly. Take a look tomorrow.

                                • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                  yurtesen

                                  drallan wrote:

                                  Is it? then I should have it. Guess I was too busy writing AMD assembly. Take a look tomorrow.

                                  I would appreciate it a lot, but you will still need MPI to be able to compile the program...

                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                      nnunn@ausport.gov.au

                                      This does sound familiar!  We have no probs with a single Cayman (or 3 x GTX580 using mpich2).  But link with OpenMPI and the proc on a single Tahiti box dies.  Easy fix for us was to link with mpich2 on the Tahiti box.

                                        • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                          yurtesen

                                          nnunn@ausport.gov.au wrote:

                                           

                                          This does sound familiar!  We have no probs with a single Cayman (or 3 x GTX580 using mpich2).  But link with OpenMPI and the proc on a single Tahiti box dies.  Easy fix for us was to link with mpich2 on the Tahiti box.

                                          I dont quite understand what you mean? What do you mean by "dies" exactly?

                                           

                                          Also, we were talking with OpenMP not OpenMPI and I used OpenMP to be able to enqueue to multiple devices concurrently using multiple contexts using threads. Although, the problem occurs with a single box with single gpu also.

                                            • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                              binying

                                              sh run.sh

                                              rm -f *.o reference

                                              mpicc  -Wall -O3 -g -I./include/ -I/usr/include/mpich2-x86_64 -I. -fopenmp  -c -o reference.o reference.cpp

                                              mpicc  -Wall -O3 -g -I./include/ -I/usr/include/mpich2-x86_64 -I. -fopenmp  -c -o opencl.o opencl.cpp

                                              mpicc  -o reference reference.o opencl.o -I./include/ -I/usr/include/mpich2-x86_64 -I. -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich -lmpl -lgomp

                                              1 platform found:

                                              -------------------------------------------------------------------------------

                                              platform 0*:

                                                  name: AMD Accelerated Parallel Processing

                                                  profile: FULL_PROFILE

                                                  version: OpenCL 1.2 AMD-APP (938.1)

                                                  vendor: Advanced Micro Devices, Inc.

                                                  extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

                                              -------------------------------------------------------------------------------

                                              * - Selected

                                               

                                              Devices of type GPU :

                                              -------------------------------------------------------------------------------

                                              0*    Cayman

                                              1     Cayman

                                              -------------------------------------------------------------------------------

                                              * - Selected

                                              Device 0 log:

                                              "/tmp/OCLxkgZoq.cl", line 150: warning: null (zero) character in input line

                                                        ignored

                                                

                                                ^

                                               

                                               

                                              Warning: galaxyz kernel has register spilling. Lower performance is expected.

                                               

                                               

                                              ../data/m.txt contains 430932 lines

                                                   first item: 52.660000 10.900000

                                                    last item: 86.260002 8.090000

                                              ../data/m_r.txt contains 430932 lines

                                                   first item: 76.089050 32.209370

                                                    last item: 27.345739 38.801189

                                               

                                              Total time for GalaXYZ input data MPI_Bcast =    0.0 seconds

                                              Real 430932 Sim 430932 Hist 257

                                              Getting total number of worker threads

                                              Total number of worker threads 1

                                              Slave node 0 thread 0 sending 0

                                              Master node 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 0 25000 to node 0 thread 0

                                              Slave node 0 thread 0 waiting

                                              Slave node 0 thread 0 received 0 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 0 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 offset 0 length 25000 events 1 time 5.85 seconds

                                              Slave node 0 thread 0 finished 0 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 25000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 25000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0

                                              Slave node 0 thread 0 offset 25000 length 25000 events 1 time 5.58 seconds

                                              Slave node 0 thread 0 finished 25000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 50000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 50000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 50000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 75000, remaining lines 0

                                              Slave node 0 thread 0 offset 50000 length 25000 events 1 time 5.36 seconds

                                              Slave node 0 thread 0 finished 50000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 75000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 75000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 75000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 100000, remaining lines 0

                                              Slave node 0 thread 0 offset 75000 length 25000 events 1 time 5.15 seconds

                                              Slave node 0 thread 0 finished 75000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 100000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 100000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 100000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 125000, remaining lines 0

                                              Slave node 0 thread 0 offset 100000 length 25000 events 1 time 4.95 seconds

                                              Slave node 0 thread 0 finished 100000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 125000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 125000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 125000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 150000, remaining lines 0

                                              Slave node 0 thread 0 offset 125000 length 25000 events 1 time 4.70 seconds

                                              Slave node 0 thread 0 finished 125000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 150000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 150000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 150000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 175000, remaining lines 0

                                              Slave node 0 thread 0 offset 150000 length 25000 events 1 time 4.46 seconds

                                              Slave node 0 thread 0 finished 150000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 175000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 175000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 175000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 200000, remaining lines 0

                                              Slave node 0 thread 0 offset 175000 length 25000 events 1 time 4.23 seconds

                                              Slave node 0 thread 0 finished 175000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 200000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 200000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 200000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 225000, remaining lines 0

                                              Slave node 0 thread 0 offset 200000 length 25000 events 1 time 3.99 seconds

                                              Slave node 0 thread 0 finished 200000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 225000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 225000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 225000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 250000, remaining lines 0

                                              Slave node 0 thread 0 offset 225000 length 25000 events 1 time 3.73 seconds

                                              Slave node 0 thread 0 finished 225000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 250000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 250000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 250000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 275000, remaining lines 0

                                              Slave node 0 thread 0 offset 250000 length 25000 events 1 time 3.47 seconds

                                              Slave node 0 thread 0 finished 250000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 275000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 275000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 275000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 300000, remaining lines 0

                                              Slave node 0 thread 0 offset 275000 length 25000 events 1 time 3.25 seconds

                                              Slave node 0 thread 0 finished 275000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 300000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 300000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 300000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 325000, remaining lines 0

                                              Slave node 0 thread 0 offset 300000 length 25000 events 1 time 3.02 seconds

                                              Slave node 0 thread 0 finished 300000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 325000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 325000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 325000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 350000, remaining lines 0

                                              Slave node 0 thread 0 offset 325000 length 25000 events 1 time 2.79 seconds

                                              Slave node 0 thread 0 finished 325000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 350000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 350000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 350000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 375000, remaining lines 0

                                              Slave node 0 thread 0 offset 350000 length 25000 events 1 time 2.55 seconds

                                              Slave node 0 thread 0 finished 350000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 375000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 375000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 375000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 400000, remaining lines 0

                                              Slave node 0 thread 0 offset 375000 length 25000 events 1 time 2.31 seconds

                                              Slave node 0 thread 0 finished 375000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 400000 25000 to node 0 thread 0

                                              Master node 0 waiting

                                              Slave node 0 thread 0 received 400000 25000

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 400000 global 25088 with vector size 1

                                              Node/Dev 0/0: First kernel processes 25000 lines with localmax 425000, remaining lines 0

                                              Slave node 0 thread 0 offset 400000 length 25000 events 1 time 2.07 seconds

                                              Slave node 0 thread 0 finished 400000 25000

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Master node 0 received id 0 thread 0

                                              Master node sending 425000 5932 to node 0 thread 0

                                              Master finished. Starting exit procedure...

                                              Slave node 0 thread 0 received 425000 5932

                                               

                                              Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

                                              Node/Dev 0/0: Maximum workgroup size 256

                                              Node/Dev 0/0: Using workgroup size 256

                                              Node/Dev 0/0: Using offset 425000 global 6144 with vector size 1

                                              Node/Dev 0/0: First kernel processes 5932 lines with localmax 430932, remaining lines 0

                                              Slave node 0 thread 0 offset 425000 length 5932 events 1 time 0.49 seconds

                                              Slave node 0 thread 0 finished 425000 5932

                                              Slave node 0 thread 0 sending 0

                                              Slave node 0 thread 0 waiting

                                              Sending exit message to node 0 thread 0

                                              Slave node 0 thread 0 received -1 -1

                                               

                                              WALL time for GalaXYZ kernel =   68.1 seconds

                                              MPI WALL time for GalaXYZ kernel =   68.1 seconds

                                              CPU time for GalaXYZ kernel  =   68.1 seconds

                                               

                                              Doubling DD angle histogram...,  histogram count = 114799465750

                                                                               Calculated      = 57399517409

                                                                               >=256           = 35451461437

                                                                               Total           = 92850978846

                                               

                                              DR angle                         histogram count = 98903437674

                                                                               Calculated      = 98903437674

                                                                               >=256           = 86798950950

                                                                               Total           = 185702388624

                                               

                                              Doubling RR angle histogram...,  histogram count = 94429502254

                                                                               Calculated      = 47214535661

                                                                               >=256           = 45636443185

                                                                               Total           = 92850978846

                                               

                                               

                                              27.62user 41.80system 1:09.43elapsed 99%CPU (0avgtext+0avgdata 258208maxresident)k

                                              24696inputs+944outputs (1major+21819minor)pagefaults 0swaps

                              • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                hsaigol

                                i have tried this on newer internal drivers on ubuntu 10.04 64bit and i was able to loop the 4.3million version over 72hours (68 loops). So i hope a future driver will fix the issue for you.
                                also as a side question if i compare the out.txt from run to run should i see differences?

                                  • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                    yurtesen

                                    Well, it is difficult to say if the results should be same or not. It depends if the hardware or threads are somehow re-ordering operations with FP numbers at each run. Can that happen? I am not an expert on what OpenCL does internally... If yes, due to differences in the order of operations, slight differences can occur. If not... let me know

                                     

                                    Do you mean that you re-ran 4.3million version over and over for 72 hours? Thanks for that (I just ask because a single run shouldnt take that long).

                                     

                                    I was just working on a version "without" MPI and OpenMP just for testing this issue and I might be able to finish it tomorrow maybe. Do you know what was the problem exactly?

                                  • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                    hsaigol

                                    Yes i was running the 4.3 million version over and over for a total of 72hours.
                                    I have no idea if the operations will be re-ordered or not i am only executing your code and testing if it fails. This was more of a personal exercise and also from how you explain your code shouldn't have caused hangs in the first place.

                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                      hsaigol

                                      hi Yurtesen

                                      when i run the 50k input file i get consistent results every time, even when i run the m.txt and m_r.txt input files i get consistent output results between runs but when i run the 4.3 million file i get different outputs run to run. *scratching head*

                                       

                                      is there any expected output file i can compare the output data to and see what is going on. Can you provide one?

                                      • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                        hsaigol
                                        m.txt and m_r.txt are the 430k files, they produce consistent results
                                        so i ran the 4.3million line version again last night on a tahiti ghz edition
                                        the numbers correspond to the loop number, the outputs matched for example from loop 17,14,2,3,4...
                                        the outputs from loop 1,11,13 matched but were different from those produced by loop 17,14,2,3...
                                        the outputs from loop 7,15 did not match anything
                                        Header 1Header 2Header 3Header 4Header 5Header 6Header 7Header 8Header 9Header 10Header 11Header 12Header 13
                                        match23456891012141617
                                        diff from above17111315
                                        sub match11113
                                          • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                            yurtesen

                                            hsaigol, I will run it on some other devices, nvidia, cpu etc. and return back to you. I believe the 4.3m file results should be exactly 10 times more than 430k results, but it appears it is rarely the case.

                                              • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                hsaigol

                                                i'll wait for your reply, if i get a chance i will try on a 78xx gpu as well and see what happens.

                                                also what version of the driver are you using

                                                can you open CCC and check under the information tab the exact information for the driver, more specifically driver packaging version. thanks

                                                  • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                    yurtesen

                                                    Meanwhile, just out of curiosity, if possible, can you have a look at the version I attached without openmp / mpi ? that might be a better test case since it is less complicated ?

                                                      • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                        drallan

                                                        Hi Yurtesen,

                                                         

                                                        I have run the 430932 size case ~50 times and do not see any hang, I've run the huge file a couple of times and see no hang.

                                                         

                                                        What I do see is excellent Tahiti performance, have you timed your new (non-MPI) program?

                                                        Tahiti is running about 4.5X faster than my Cayman. Tahiti 10.1 seconds vs Cayman 46.9 seconds.

                                                        Tahiti huge problem about 1003 seconds, (100X for N*N problem)

                                                        The 'huge' output numbers are roughly 100 times larger, as expected, with slight differences after dividing by 100.

                                                         

                                                        So, I see nothing unusual so far, though I am curious about your run time for the new code.

                                                        Also, are you still seeing register spilling?

                                                        BTW, I'm running the Tahitis at 1200 Mhz.

                                                         

                                                        TAHITI

                                                        -------------------------------------------------------------------------------

                                                        Real 430932 Sim 430932 Hist 257

                                                        Using workgroup size 256

                                                        Using global size 431104

                                                        Running OpenCL GalaXYZ

                                                        Queueing part       0 -   25000 of  431104... Kernel finished 1.531

                                                        Queueing part   25000 -   50000 of  431104... Kernel finished 0.870

                                                        Queueing part   50000 -   75000 of  431104... Kernel finished 0.799

                                                        [.....]

                                                        Completed OpenCL GalaXYZ

                                                        WALL time for GalaXYZ kernel =   10.1 seconds

                                                        CPU time for GalaXYZ kernel  =   10.1 seconds

                                                         

                                                        Doubling DD angle histogram...,  histogram count = 169741846286

                                                                                         Calculated      = 84870707677

                                                                                         >=256           = 0

                                                                                         Total           = 84870707677

                                                        DR angle                         histogram count = 169146070638

                                                                                         Calculated      = 169146070638

                                                                                         >=256           = 0

                                                                                         Total           = 169146070638

                                                        Doubling RR angle histogram...,  histogram count = 168527020850

                                                                                         Calculated      = 84263294959

                                                                                         >=256           = 0

                                                                                         Total           = 84263294959

                                                        CAYMAN

                                                        -------------------------------------------------------------------------------

                                                        Real 430932 Sim 430932 Hist 257

                                                        Using workgroup size 256

                                                        Using global size 431104

                                                        Running OpenCL GalaXYZ

                                                        Queueing part       0 -   25000 of  431104... Kernel finished 1.521

                                                        Queueing part   25000 -   50000 of  431104... Kernel finished 4.133

                                                        Queueing part   50000 -   75000 of  431104... Kernel finished 3.874

                                                        Queueing part   75000 -  100000 of  431104... Kernel finished 3.816

                                                        Queueing part  100000 -  125000 of  431104... Kernel finished 3.689

                                                        [.....]

                                                        Completed OpenCL GalaXYZ

                                                        WALL time for GalaXYZ kernel =   46.9 seconds

                                                        CPU time for GalaXYZ kernel  =   46.9 seconds

                                                        Doubling DD angle histogram...,  histogram count = 169741846286

                                                                                         Calculated      = 84870707677

                                                                                         >=256           = 0

                                                                                         Total           = 84870707677

                                                        DR angle                         histogram count = 169146070638

                                                                                         Calculated      = 169146070638

                                                                                         >=256           = 0

                                                                                         Total           = 169146070638

                                                        Doubling RR angle histogram...,  histogram count = 168527020850

                                                                                         Calculated      = 84263294959

                                                                                         >=256           = 0

                                                                                         Total           = 84263294959

                                                        -------------------------------------------------------------------------------

                                                          • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                            yurtesen

                                                            Can you attach the results you are getting from the 4.3m case? Are you getting similar/same results on consecutive runs? (I guess results must be same)

                                                            I attached the output from an nvidia tesla card using the same version of the program. I am not able to run it on AMD cards at all anymore. Coincidentally, with the latest drivers I have, it also doesnt work on Cypress and it crashes.

                                                             

                                                            Also, this version is not the fastest version. I have a version with vector elements, which can do 4.3m case in ~510 seconds on Tahiti.  On a Tesla M2050, the same computation takes 4500 seconds (timings are best timings from different optimized versions for Tahiti and Tesla).

                                                             

                                                            I didnt attach vector versions of the code because these versions were simpler and probably better for debugging the issue of locking.

                                                             

                                                            Do you have access to Linux workstation with Tahiti? or do you see any visible problems in memory allocation etc in the code? Am I the only person who cant run my own code?

                                                             

                                                            The register spilling message seem to come on cypress but not on tahiti(as far as I can remember)... I wasnt worried about that yet, since it was crashing

                                                              • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                drallan

                                                                Am I the only person who cant run my own code?

                                                                Yes, this is the way the world works.

                                                                 

                                                                Do you have access to Linux workstation with Tahiti? or do you see any visible problems in memory allocation etc in the code? I want to put Linux on the system but that will take a little time. No, the code looks very straight forward and is easy to work with, I can't see it causing the hang.

                                                                 

                                                                I will post some huge data shortly, after a few more runs. I got your data file, thanks.  BTW, I do see one way that data output can vary but probably not from run to run, I think you must know this and its from troubleshooting. When the data is broken into parts that are not divisible by 256, the kernel's length is rounded up to a multiple of 256. Then on each run (but not the last) the last few workitems will add a little more to the histograms (eg. chunk size = 25000, kernel work length = 25088, so 88 WIs will add to the histograms. I noticed that 8*1024, 16*1024, and 32*1024 chunks always give the same (and minimum) answers, which are different from the 25000 chunk size results. I don't see how that would change from run to run though.

                                                                 

                                                                510 seconds vs 4500 is real impressive, which is why this problem must be solved!!

                                                                My guess is the drivers, multiple Tahiti's have seemed somewhat troublesome.

                                                                  • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                    yurtesen

                                                                    Yes, I know about the extra histograms added in that test program (due to quick hack) It originally did all the iterations in one go, so there was no need to check those extra threads hanging in the end of each run. I have that covered in MPI versions.

                                                                     

                                                                    I had to divide the run into smaller pieces (and run with MPI) because I am running the program on a cluster with many nodes. I needed to send jobs piece by piece to nodes. Furthermore, the problem gets smaller near the end (2 of the inner loops start from i+1). I decided to send small pieces to nodes so load will be balanced. Since if I divided the problem into number of nodes, the last nodes would have finished very quickly and sit idle.

                                                                     

                                                                    drallan wrote:

                                                                     

                                                                    510 seconds vs 4500 is real impressive, which is why this problem must be solved!!

                                                                    My guess is the drivers, multiple Tahiti's have seemed somewhat troublesome.

                                                                    You are right, and we are going to write a paper about this, which will probably benefit AMD also. But if I cant get these programs to run properly, then well nvidia will win

                                                                     

                                                                    Since you seem to be interested and very helpful, here are some fun facts for you The program calculates two-point angular correlation function, my code is different (I follow a slightly different method), but there is an explanation of it here (and that paper is old but those guys even made code for FPGAs! ):

                                                                    http://www-vm00.ncsa.illinois.edu/~kindr/projects/hpca/files/gpgpu09_presentation.pdf

                                                                    http://www.ncsa.illinois.edu/~kindr/projects/hpca/files/ECE498AL_problem_statement.pdf

                                                                     

                                                                    I have now attached float4 and float8 versions of the code, you should be able to compile them also. These do not have any problems These work perfectly on everything from amd/nvidia and intel (I just couldnt get it working on PlayStation3 ). The Nvidia GPUs do not seem to like vectors so these are not the best codes for them. The float8&fx-8150 cells are empty because AMD SDK crashed when making AVX code with float8. I dont know if it will be fixed or when... but AMD confirmed the problem... (it was easy to reproduce, since kernel even caused kernelanalayzer to crash )

                                                                     

                                                                    430937 Lines – NormalAMD FX-8150
                                                                    AMD SDK
                                                                    AMD FX-8150
                                                                    Intel SDK
                                                                    CypressTahitiTesla M2050GTX580(oc)GTX680I7 980
                                                                    Intel SDK
                                                                    X5650
                                                                    AMD SDK
                                                                    ocl6_float4_v3_ulong_amd323.3325.7717.4312.9117.2670.7488.48244.56475.48
                                                                    ocl6_float4_v3_ulong_amd_jancos839.83304.3712.756.8112.265.0889274.26308.77
                                                                    ocl6_float8_v3_ulong_amd_jancos

                                                                    11.626.75137.381.22112.23273.28377.4

                                                                     

                                                                     

                                                                     

                                                                    At this point, I also think there is extra problems due to OpenMP also. Strangely nothing I have with OpenMP is working at all. I get wrong results... They used to work earlier. I will have to debug everything again... I will post updates if I find more.

                                                                      • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                        drallan

                                                                        Ah, the Slone digital sky survey, yes it does all become clear. 230 million galactic entities so you need to

                                                                        convolve the entire universe, only Tahiti can do that . The Slone survey is impressive.

                                                                         

                                                                        I attached my 'huge' data output. It is slightly different than yours on the order of about ~1/10000.

                                                                        I can produce the same differences  with most any alteration of execution order (as you mentioned earlier).

                                                                        The next three files are:

                                                                          1. Tahiti output from the 430K problem.

                                                                          2. Same, but using the ocl fma instruction in place of the sum x*x + y*y + z*z.

                                                                          3. Same, no fma, but reversing the sum order, i.e., z*z + y*y + x*x. (only for the DD histogram,). again similar differences.

                                                                         

                                                                        So it seems these differences are from "binning noise". Still not sure how that would happen between runs.

                                                                        I will look at the faster vector programs.

                                                                        At 6.7 seconds,you are probably memory bound by Tahiti's 390GBs/sec global bandwidth.

                                                                          • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                            yurtesen

                                                                            Your outputs all look good. Some strangeness was expected due to extra threads doing extra calculations in the end of each step due to the quick hack

                                                                             

                                                                            From MPI versions and float4/8 versions I am getting exactly 100 times larger values which is perfect.

                                                                             

                                                                            Just that the Tahiti and also now Cypress is crashing on me after several kernel enqueues. Hsaigol said he gets it working with 'latest internal drivers'. I would like to get my hands on those latest internal drivers!

                                                                            • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                              yurtesen

                                                                              drallan, one more thing... is it possible for you to check the 'problem1' in my first post in this thread ?

                                                                                • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                  drallan

                                                                                  drallan, one more thing... is it possible for you to check the 'problem1' in my first post in this thread ?

                                                                                   

                                                                                  Hi Yurtesen,

                                                                                  Here's some data that makes me think the time difference may not be due to ocl buffers.

                                                                                   

                                                                                  1. I defined 6 ordinary device buffers (flag=0) and manually wrote them to the card before running the kernel, and did not see any difference.

                                                                                   

                                                                                  2. here are three runs of the 430K problem where the only difference is the size of the kernel run. (all times are slightly faster because I rearranged the order of memory reads in the kernel, not relevant to this data). I see the same kind of slow down for 25000 chunks but   the 32K chunks actually run faster! . This makes me think the time difference is due to something like memory access patterns, cache, etc.

                                                                                   

                                                                                  Kernel size         Run time

                                                                                    431104               8.0 sec.   baseline, one large single block

                                                                                    25000                 8.7 sec.   multiple pieces, shows same slowdown as in problem 1

                                                                                    32768                 7.6 sec.   binary power of 2 happy size

                                                                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                      yurtesen

                                                                                      Sorry for the delayed answer. I had to deal with a pile of unnecessary stuff recently

                                                                                      I get 34 seconds with single run of 430932, and 38.5 seconds when I run with 32768 steps. and 36.8 seconds with 25000 steps... on Cypress

                                                                                       

                                                                                      From the point of kernel, memory access shouldnt be much different than running in one piece, no?

                                                                                      Because for example if we had i=1,2,3,4,5 then i=6,7,8,9,10 (1 to 5 then 6 to 10) , compared to i=1,2,3,4,5,6,7,8,9,10 exactly same operations will be done? are the threads starting randomly?. We are even queuing with order and not doing things like 6,7,8,9,10 then 1,2,3,4,5.

                                                                                       

                                                                                      The kernel takes at least 2-3 seconds, it is large enough to even things out. How do you think I can debug this issue? Any pointers? I need to know why the performance is so variable.

                                                                                       

                                                                                      Did you also try to queue all the kernels then flush the queue without waiting for each other to end? (I wonder if the events somehow adding some delays?

                                                                                        • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                          drallan

                                                                                          I also ran Cayman, and 32768  threads is still a little faster than 25000, which is different from your Cypress data. I didn't run the 430K  on Cayman because it times out.

                                                                                                     430K  32768   25000            

                                                                                          Cayman     ---    37.1    40.7

                                                                                          Tahiti     8.0     7.6     8.6

                                                                                          I would assume your right that these algorithms should run about the same and memory access patterns should be about the same. Although, 32768 is a real sweet spot for Tahiti's ALU as long as latency is not a problem.

                                                                                           

                                                                                          A lot of things can make small changes, some of which can vary from one machine, OS, or driver to the next. I even saw that dragging the dos prompt to a different monitor is worth about 0.4 seconds.

                                                                                           

                                                                                          Did you also try to queue all the kernels then flush the queue without waiting for each other to end? (I wonder if the events somehow adding some delays?

                                                                                           

                                                                                          Yes, maybe that gives a good clue, it slows down (not waiting for each kernel to finish) from 7.6 to 7.9 seconds, so perhaps larger numbers of threads can be slightly less efficient. Of course they are wonderful for memory latency.

                                                                                           

                                                                                          The big question is how can you get your Tahiti's running???

                                                                                           

                                                                                          drallan

                                                                                            • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                              yurtesen

                                                                                              Have you considered if you might get better speed for example if you used 40000? simply because it is a longer run?

                                                                                               

                                                                                              Well, I guess it might be a relief that Nvidia does even worse:

                                                                                              25k steps 94.0s

                                                                                              32768 steps 85.8s

                                                                                              430932 step 66.5s

                                                                                              (queuing all at once or waiting for kernel runs to finish does not seem to be making any difference on Nvidia)

                                                                                              Tomorrow I might try to run it through profiler on Nvidia I guess...

                                                                                               

                                                                                              drallan wrote:

                                                                                              The big question is how can you get your Tahiti's running??? 

                                                                                               

                                                                                              Good question, but the large data is nowadays failing on Cypres also. It used to work perfectly fine when I created this thread! Also Hsaigol says he is also able to run the program on Ubuntu 10. Maybe the best thing I can do is to install Ubuntu 10 to a USB stick and test on that.

                                                                                               

                                                                                              By the way, I was wondering, how difficult is it to setup cygwin to compile and run this program? There is so much I can try and I have so little patience left

                                                                                                • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                  nnunn@ausport.gov.au

                                                                                                  yurtesen, this looks like a great exercise for optimizing OpenCL on GCN.  The layout of this thread makes it a bit tricky to work out which is the current suggested experimental code.  For running under MPI on two 7970's, should I start with the code and data in your original post,

                                                                                                   

                                                                                                  ../eyurtese/amd/galaxyz.tgz,  ../eyurtese/amd/galaxy_data.tgz ?

                                                                                                   

                                                                                                  In our own codes, we've been having some >fun< with events, queues and timing.  Sorting out your issue may help us all learn a thing or two about GCN.

                                                                                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                      yurtesen

                                                                                                      For debugging, it would be best to look at this code (I posted it to drallan earlier):

                                                                                                      http://devgurus.amd.com/servlet/JiveServlet/download/1283848-1936/ocl1_orig_jancos_steps.tgz

                                                                                                       

                                                                                                      It is a quick hack WITHOUT MPI or OpenMP and simply in opencl.cpp file I have made a loop which enqueues kernels in pieces with offsets instead of giving all at once for running. It crashes on a single card also (at least on my machines). I would be happy to hear if it works for you or not.  The file ./eyurtese/amd/galaxy_data.tgz includes the input data you will need to run the program.

                                                                                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                      drallan

                                                                                                      --Have you considered if you might get better speed for example if you used 40000? simply because it is a longer run?

                                                                                                       

                                                                                                      It runs just a tad bit slower, not much though.

                                                                                                       

                                                                                                      By the way, I was wondering, how difficult is it to setup cygwin to compile and run this program? There is so much I can try and I have so little patience left

                                                                                                       

                                                                                                      Cygwin should be fairly easy to install, I believe there is a setup program that downloads and installs everything for you.

                                                                                                      I mostly use a bare mingw installation unless make files require a shell, then use cygwin or msys.

                                                                                                       

                                                                                                      FWIW, my multi Tahiti system had problems with  driver upgrades for a very long time. I usually run the original, old drivers that came with the cards, or the more recent drivers since 8.98.2 seem better. I assume that certain Tahiti configurations were problematic for a while.  Is it easy to install drivers on Linux? Perhaps you could try both old and new and maybe one in the middle. On the other hand, if they all fail the same way, then maybe the problem is elsewhere? Just a thought.

                                                                                                        • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                          yurtesen

                                                                                                          drallan wrote:

                                                                                                           

                                                                                                            Is it easy to install drivers on Linux? Perhaps you could try both old and new and maybe one in the middle. On the other hand, if they all fail the same way, then maybe the problem is elsewhere? Just a thought.

                                                                                                          I just had a blast from the past and installed ubuntu 10.04 on a usb stick and I am trying to install amd drivers now. I think the driver version is not very relevant since hsaigol told that he can run it both with latest drivers and also with internal drivers. Which makes me think that the problem might be something to do with kernel version or X version. I will update the thread after I run tests.

                                                                                                            • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                              hsaigol

                                                                                                              i am currently testing again on 10.04

                                                                                                              i'll get your timing information for the following with the 4.3 million file

                                                                                                               

                                                                                                              all 4.3million  together

                                                                                                              in steps of 25000: 1750s

                                                                                                              in steps of 32768

                                                                                                              in steps of 430932

                                                                                                              in steps of 50000

                                                                                                               

                                                                                                              Here is data for 430k since its much faster and i need to head home now. will complete 4.3million later

                                                                                                              in steps of 25000: 17.4s

                                                                                                              in steps of 32768: 15.8s

                                                                                                              in steps of 50000: 14s

                                                                                                              all 430932 together: 10.8s

                                                                                                               

                                                                                                              i dont know how exactly you are timing your code, may be the execution time is the same but the total program time is different due to some overheads

                                                                                                               

                                                                                                              the results for this exercise will be on diff sku of tahiti which has different clocks so the data will not be apples to apples compared to yours

                                                                                                              but all my runs will be on the same card/setup so you can compare them

                                                                                                               

                                                                                                              also i have noticed that the uninstall of the drivers is horrible, i just end up reimaging the hard drive everytime i have to switch the driver.

                                                                                                               

                                                                                                              so yurtesen i would recommend you make a clone of the usb after installing linux on it so that you can try different drivers without having to go through reinstall, just reclone with clonezilla or something similar

                                                                                                               

                                                                                                              also note i install the following items on my 10.04

                                                                                                              apt-get install mpich2 openmpi-bin openmpi-doc libopenmpi-dev g++
                                                                                                              AMD-APP-SDK-v2.7-lnx64.tar --> official website

                                                                                                              amd-driver-installer-12-8-x86.x86_64.zip --> offical website

                                                                                                               

                                                                                                              following change in Makefile
                                                                                                              remove -lmpl from: LD_FLAGS= -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich -lmpl -lgomp

                                                                                                              LD_FLAGS= -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich  -lgomp

                                                                                                                • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                  hsaigol

                                                                                                                  yurtesen read the conclusion in this post, does that align with what you're seeing

                                                                                                                  http://devgurus.amd.com/message/1282801#1282801

                                                                                                                   

                                                                                                                  also drallan how are you getting such amazing speedups when you break the worksizes

                                                                                                                             430K  32768   25000           

                                                                                                                  Cayman     ---    37.1    40.7

                                                                                                                  Tahiti     8.0     7.6     8.6

                                                                                                                  ???


                                                                                                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                      yurtesen

                                                                                                                      hsaigol wrote:

                                                                                                                       

                                                                                                                      yurtesen read the conclusion in this post, does that align with what you're seeing

                                                                                                                      http://devgurus.amd.com/message/1282801#1282801

                                                                                                                       

                                                                                                                      also drallan how are you getting such amazing speedups when you break the worksizes

                                                                                                                                 430K  32768   25000           

                                                                                                                      Cayman     ---    37.1    40.7

                                                                                                                      Tahiti     8.0     7.6     8.6

                                                                                                                      ???


                                                                                                                      I am not sure how to compare, that post seems to refer to multi-gpu implementation of dgemm which require some transfers from GPUs etc. I do not require transfers to anywhere between kernel runs.

                                                                                                                       

                                                                                                                      About drallan's code, I think he shouldnt have gotten 8 seconds with that code he had. It is truely amazing (I just realized how small the numbers he had were yesterday!). However he is running a different version than what you have (no mpi oe openmp). See ocl1_orig_jancos_steps.tgz file I posted to him.

                                                                                                                       

                                                                                                                      I will have to run the exact same code which I have given to him on Tahiti also. I will try to test it all today with Ubuntu 10.04 etc. I will update

                                                                                                                      • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                        drallan

                                                                                                                        hsaigol wrote:

                                                                                                                        also drallan how are you getting such amazing speedups when you break the worksizes

                                                                                                                                   430K  32768   25000           

                                                                                                                        Cayman     ---    37.1    40.7

                                                                                                                        Tahiti     8.0     7.6     8.6

                                                                                                                        ???


                                                                                                                         

                                                                                                                        Ah, yes. The original thread was about why the algorithm crashed only on Tahiti, and why it only crashed on the author's machine. In the beginning, I added a fairly simple optimization that improved the execution of smaller chunks as I understood the ultimate target was a distributed network using chunks. Those are the numbers I have been reporting. I mentioned this but not very clearly. Now that the thread is turning towards optimization, it's good you asked the question. My full set of numbers is:

                                                                                                                                   430K  32768   25000           

                                                                                                                        Tahiti     7.2     8.4    10.1     Author's original algorithm without MPI (NoMPI)

                                                                                                                        Tahiti     8.0     7.6     8.6     NoMPI with chunk optimization

                                                                                                                        Tahiti     8.0     7.3     8.4     NoMPI, same chunk optimization but cleaner

                                                                                                                        Cayman     ---    37.1    40.7     NoMPI with chunk optimization on Cayman

                                                                                                                        (Tahiti 1200MHz, Cayman 950MHz)

                                                                                                                         

                                                                                                                        The optimization combines loops that calculate the angles, where possible, to prevent re-referencing the same area of global memory, this should be  cache friendly for  small chunks. Other than that, 32768 is exactly 8 waves which fully utilizes the CUs without a large number of waiting threads. I think though that yurtesen probably has some better versions of the algorithm.

                                                                                                              • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                yurtesen

                                                                                                                drallan wrote:

                                                                                                                 

                                                                                                                I also ran Cayman, and 32768  threads is still a little faster than 25000, which is different from your Cypress data. I didn't run the 430K  on Cayman because it times out.

                                                                                                                           430K  32768   25000            

                                                                                                                Cayman     ---    37.1    40.7

                                                                                                                Tahiti     8.0     7.6     8.6

                                                                                                                 

                                                                                                                The big question is how can you get your Tahiti's running???

                                                                                                                 

                                                                                                                I thought your tahitis were overclocked to 1.2ghz? I am getting better results with MSIs 1010mhz tahiti card when 430932 step is used (these are from ubuntu 10.04). Yet my results get slower and slower...

                                                                                                                 

                                                                                                                GPU
                                                                                                                430K5000032768250005000
                                                                                                                Tahiti 1010mhz MSI7.2s10.6s11.2s13.1s38.4s

                                                                                                                here are step sizes and kernel run times:

                                                                                                                50k case runs 9 times      (10.6 - 7.2 ) / 9  = 0.38s

                                                                                                                32k case runs 13 times    (11.2 - 7.2 ) / 13 = 0.31s

                                                                                                                25k case runs 18 times    (13.1 - 7.2 ) / 18  = 0.33s

                                                                                                                5k case runs 87 times      (38.4 - 7.2 ) / 87 = 0.36s

                                                                                                                From this figure, I can say that each kernel run has a ~0.35s delay. This cant be a coincidence right? (although I dont know how drallan is getting those crazy results yet )

                                                                                                                  • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                    drallan

                                                                                                                    From this figure, I can say that each kernel run has a ~0.35s delay. This cant be a coincidence right? (although I dont know how drallan is getting those crazy results yet )

                                                                                                                     

                                                                                                                    yurtesen, congratulations on isolating the problem.

                                                                                                                     

                                                                                                                    Crazy numbers, please see my previous post, all my numbers have been for that code.

                                                                                                                    I also see your 7.2 at 1010Mhz, is the same as my 7.2 at 1200 both using the original code! Sigh.

                                                                                                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                      drallan

                                                                                                                      Here is the kernel I have been using, it assumes that 'real' and 'sim' sizes are equal. If not, a slightly more complex structure is needed.

                                                                                                                      There is a #define to switch back to the original version.

                                                                                                                       

                                                                                                                      I wonder if this might explain your constant 0.35 second time.

                                                                                                                      It might relate to extra work for each chunk in the original version that is not done in the optimized version.

                                                                                                                       

                                                                                                                      Attached kernel file, for reference, Tahiti, 1200MHz., 32768 thread, is 7.3 seconds.

                                                                                                                        • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                          yurtesen

                                                                                                                          drallan wrote:

                                                                                                                           

                                                                                                                          Here is the kernel I have been using, it assumes that 'real' and 'sim' sizes are equal. If not, a slightly more complex structure is needed.

                                                                                                                          There is a #define to switch back to the original version.

                                                                                                                           

                                                                                                                          I wonder if this might explain your constant 0.35 second time.

                                                                                                                          It might relate to extra work for each chunk in the original version that is not done in the optimized version.

                                                                                                                           

                                                                                                                          Attached kernel file, for reference, Tahiti, 1200MHz., 32768 thread, is 7.3 seconds.

                                                                                                                          drallan wrote:

                                                                                                                           

                                                                                                                          Here is the kernel I have been using, it assumes that 'real' and 'sim' sizes are equal. If not, a slightly more complex structure is needed.

                                                                                                                          There is a #define to switch back to the original version.

                                                                                                                           

                                                                                                                          I wonder if this might explain your constant 0.35 second time.

                                                                                                                          It might relate to extra work for each chunk in the original version that is not done in the optimized version.

                                                                                                                           

                                                                                                                          Attached kernel file, for reference, Tahiti, 1200MHz., 32768 thread, is 7.3 seconds.

                                                                                                                          In reality they are always equal in all my tests but the sample code I got had 3 different loops so I thought I should take care of that.

                                                                                                                           

                                                                                                                          I guess even when I run all at one go, there is probably a 0.35s delay when kernel starts. I am just not able to measure it. I am not sure if a simple if statement can cause 0.35s delay...

                                                                                                                           

                                                                                                                          I will test your code later (it might take a few days, the machine is unavailable now, I have to go and boot it into linux)

                                                                                                                          • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                            yurtesen

                                                                                                                            With your code, I get 9.9s for 18 enqueues vs 8.8s with single enqueue. The difference is much less, but I dont understand how can an if statement cause this. Is there an explanation? You said "extra work for each chunk" but arent these kernels are run by each thread independent of if the kernel was queued in a single go or not? Is there any documentation which explains this?

                                                                                                                              • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                                drallan

                                                                                                                                yurtesen wrote:

                                                                                                                                 

                                                                                                                                With your code, I get 9.9s for 18 enqueues vs 8.8s with single enqueue. The difference is much less, but I dont understand how can an if statement cause this. Is there an explanation? You said "extra work for each chunk" but arent these kernels are run by each thread independent of if the kernel was queued in a single go or not? Is there any documentation which explains this?

                                                                                                                                When adjusted for 1200/1010 MHz,  your numbers are the same as mine and I re-checked to make sure I didn't scramble anything. So the data looks real. Then you should get about 8.1 sec for the 32768 size run (even more  confusing!)

                                                                                                                                 

                                                                                                                                So your question is valid. Whether run in chunks or whole, it seems each workitem would read the same data. Where does the difference come from? I think there are  2 parts to the answer.

                                                                                                                                 

                                                                                                                                In both programs, each WI reads a wide range of  memory which is broken into 2 sets of X,Y, and Z blocks for both real and sim data. In your case, two of the loops read the same region of memory (in sim data) where one loop reads the entire range then the next loop goes back and reads the same range again. This is what I meant by extra work. If the loops are combined, this can reduce to a single read per WI. Thus the single loop version should be more efficient in general.

                                                                                                                                 

                                                                                                                                The second part, why this seems to help smaller chunks probably depends on cache behavior, which can be complex. One point is that the chunk sizes are similar in size  to the L1 and L2 caches ( roughly about 1/2MB), so one might expect to see differences. My guess was that a perfect 8 wave 32K sample would be favored.  Then of course there is always some question about ocl drivers and interface adding something on top

                                                                                                                                • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                                  drallan

                                                                                                                                  I just downloaded AMD's new CODEXL tool and ran profiles that show cache and memory activity for the programs.

                                                                                                                                  Cache seems to be the biggest factor, small 3-loop runs have low cache hit rates while small 1-loop runs run best

                                                                                                                                  from the cache. Do you think these low cache hit rate can explain the constant delays?

                                                                                                                                   

                                                                                                                                  Name    KernelSize  CacheHits(%) AvgFetchs/WI

                                                                                                                                  ---------------------------------------------------  

                                                                                                                                  rg.exe    430000      87.        2585077

                                                                                                                                  r25.exe    25000      62.        2550000

                                                                                                                                  r32.exe    32768      55.        2550000

                                                                                                                                   

                                                                                                                                  rog.exe   430000      49.        1939032  oneloop

                                                                                                                                  ro25.exe   25000      99.        1930000    "

                                                                                                                                  ro32.exe   32768      96.        1940000    "

                                                                                                                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                                      yurtesen

                                                                                                                                      Thanks drallan, I am trying to run codexl myself now on Linux...

                                                                                                                                      This could maybe explain the issue somewhat. Perhaps your kernel is effected less since it requires less fetches.

                                                                                                                                      I will have to run some tests and return back to you. It is still strange, because I believe data shouldnt fit to 2-3mb cache. I will also test breaking the kernel and making all loops start from i=0 etc.

                                                                                                                                      Do you know if there is any mechanism in GPU which can prefetch data? (I know very little about how the caching works on the GPU). Anyway, I have a lot of things to test now, I will be back when I have enough information to figure this out without any doubts

                                                                                                                                      • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                                        yurtesen

                                                                                                                                        Hello Drallan, I guess you did it again I am getting same results as you get. ( + - few percent). Therefore I do agree that cache behavior must be related to this difference in speed.

                                                                                                                                         

                                                                                                                                        Although, it is very strange to think that it would effect this much. After all, 0.3 seconds average extra per kernel run is quite long.  It would take less time to read GPU memory start to end... but I am running out of time and this explanation will do fine

                                                                                                                                         

                                                                                                                                        Thanks for your help and it is amazing that you have dedicated so much time for helping....

                                                                                                            • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                              yurtesen

                                                                                                              Hi hsaigol, after some consideration. I think the differences in results is caused by OpenMP somehow. It doesnt even work properly on Nvidia cards now. I recommend trying the non-openmp/mpi version quick hack I have posted to forum. It crashed as well but the results should be same after each run with same workgroup size.

                                                                                                               

                                                                                                              I have installed amd-driver-installer-8.982-x86.x86_64.run and in my xorg log I see

                                                                                                              compiled for 1.4.99.906, module version = 8.98.2

                                                                                                               

                                                                                                              I will try to figure out the issue and fix it in the following days. I will let you know if I can fix it or not.

                                                                                                        • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                          liwoog

                                                                                                          I am finally hopeful that my code is running properly. What I learned to make it work:

                                                                                                           

                                                                                                          1) Make sure to use clFlush after all queuing as the AMD implementation does not seem to allow a kernel parameters to be changed and the kernel requeued before it is flushed.

                                                                                                           

                                                                                                          2) What was killing me: waiting for events on different queues does not seem to work. I had two queues waiting for events on one another and clEnqueueBuffer events did not properly wait.

                                                                                                           

                                                                                                          3) FInally, because of mixed GPU environments, I was allocating a context per GPU instead of a context per platform. The NVIDIA implementation did not care, but the AMD one did.

                                                                                                            • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                              hsaigol

                                                                                                              sorry didnt get any testing done today, I am totally swamped with work. may be tomorrow i'll get some time

                                                                                                                • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                  yurtesen

                                                                                                                  Hi hsaigol, thanks for all your help. Can you tell what version of the driver do you have?

                                                                                                                   

                                                                                                                  I think I know why you were getting different results.In opencl.cpp file near line 356 there was a bug. I am very sorry for that!

                                                                                                                  The last 2 .enqueueWriteBuffer lines were in wrong order.  I fixed it and updated the link at first post:

                                                                                                                  http://users.abo.fi/eyurtese/amd/galaxyz.tgz

                                                                                                                   

                                                                                                                  I attached a zip file to this post which has the corrected opencl.cpp and the expected output files for 430k and 4300k cases as well. (the only difference in output is that 4300k case has 100 times larger output)

                                                                                                                   

                                                                                                                  Thanks,

                                                                                                                  Evren

                                                                                                                    • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                      hsaigol

                                                                                                                      Hi Yurtesen<
                                                                                                                      The version posted above works and i get consistent outputs that match your posted results.

                                                                                                                      to get the program to compile i had to make the following change in the makefile (happened for previous version too)

                                                                                                                      #LD_FLAGS= -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich -lmpl -lgomp

                                                                                                                      LD_FLAGS= -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich  -lgomp

                                                                                                                       

                                                                                                                      I'm using driver: 9.01-120904a

                                                                                                                       

                                                                                                                      time taken to complete the 4.3million line file on a tahiti ghz edition: 1450 seconds

                                                                                                                      how good or bad is that compared to the top nvidia card you are using?

                                                                                                                        • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                          yurtesen

                                                                                                                          hsaigol wrote:

                                                                                                                          The version posted above works and i get consistent outputs that match your posted results.

                                                                                                                          I'm using driver: 9.01-120904a

                                                                                                                          Thanks and sorry for the earlier bug in the program (2 lines caused so much trouble!), so there is probably bug in current Linux drivers that I have (btw Cypress was also crashing)

                                                                                                                           

                                                                                                                          How can I get hold of 9.01 ?

                                                                                                                           

                                                                                                                          Also one more thing, I used to see a significant time difference between running the program with 50k steps, or all at one. Please see my 'Problem1'  in my first post of the thread. Is this fixed also?

                                                                                                                            • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                              hsaigol

                                                                                                                              Hi,

                                                                                                                              don't worry about the trouble at least it helped you fix your code so thats great

                                                                                                                               

                                                                                                                              i understand what your question is but how do i test it, what do i need to modify in the code (better if you provide it) so that i have a version which runs through all the execution without subdivisions

                                                                                                                              if you want me to do the comparison you have to provide the other version so i can run it and see if the problem is fixed.

                                                                                                                              lastly i can test this on one of the dual GPU boards so if you want to provide a version of the code which splits the subset work onto multiple gpu's i can try that too.

                                                                                                                               

                                                                                                                               

                                                                                                                              i'm in the office for another 10-15mins and i'm hoping you can provide the code so i can test it before the weekend

                                                                                                                               

                                                                                                                              as for when is the driver branch being released, i'll ask the SW team and get back to you on that. I do want do double check and see if the drivers on the amd webpage hang for me as well in my setup.
                                                                                                                              sorry but i can't provide you with the internal drivers. There is one really neat features that are coming though I dont know if i'm allowed to write about it

                                                                                                                                • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                                  yurtesen

                                                                                                                                  hsaigol wrote:

                                                                                                                                  i understand what your question is but how do i test it, what do i need to modify in the code (better if you provide it) so that i have a version which runs through all the execution without subdivisions

                                                                                                                                  if you want me to do the comparison you have to provide the other version so i can run it and see if the problem is fixed.

                                                                                                                                  lastly i can test this on one of the dual GPU boards so if you want to provide a version of the code which splits the subset work onto multiple gpu's i can try that too.

                                                                                                                                  Actually, in defs.h there is a WORKSIZE define, you simply should set it to number of lines in the input then recompile, for example: 430932

                                                                                                                                  When running, program will send a single work and then realize that there is nothing more. So you will have a version which enqueues the kernel 1 time only with minimal change to the code (therefore perhaps it is easier to see what is going on). Under normal operation there are some idle threads in each kernel run, (due to rounding size to multiple of worksize) but that shouldnt cause very great performance difference obviously.

                                                                                                                                   

                                                                                                                                  I dont understand how come you are able to run the program without crashing. I think it is more or less sure now that there is not a bug in the program itself... But can you test it with Ubuntu 12.04? Ubuntu 10.04 is EOL next year (desktop version, the server version is ending on 2015 but still, nobody will install 10.04 today when making new systems), isnt it logical to test on 12.04 also?

                                                                                                                                  https://wiki.ubuntu.com/Releases

                                                                                                                      • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                        hsaigol

                                                                                                                        Hi Yurtesen,

                                                                                                                        I have some good news (bad for you i'm guessing)

                                                                                                                        The news is that i was able to run your new program "http://users.abo.fi/eyurtese/amd/galaxyz.tgz" without any hangs with the 4.3 million line input using graphics drivers from the official AMD website

                                                                                                                         

                                                                                                                        driver: 8.982-120727a-144949C-ATI

                                                                                                                        OS: Ubuntu 10.04

                                                                                                                        Kernel: 2.6.32-33-generic x86_64

                                                                                                                         

                                                                                                                        on a clean install of ubuntu the following addition libraries/programs were installed

                                                                                                                        apt-get install mpic2

                                                                                                                        apt-get install openmpi-bin openmpi-doc libopenmpi-dev

                                                                                                                        apt-get install g++       (v4.4)

                                                                                                                         

                                                                                                                        after this i installed the AMD APP SDK

                                                                                                                         

                                                                                                                        setup the paths

                                                                                                                        and ran your program and it completed without any issue

                                                                                                                         

                                                                                                                        I even scoped the voltage rails on the board while the app was running to try and capture the failing moment but it never failed

                                                                                                                        it took me 1492 seconds to complete the program, during which time the display, console and mouse is almost compeltely non-responsive. So i just wait and let the system run since i knew the completion time from before and voila after 1500seconds the program was complete and everything returned to normal.

                                                                                                                          • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                            yurtesen

                                                                                                                            First things first... I ran the program from USB with Ubuntu 10.04 and it indeed does run without crashing. Therefore I feel that there is a driver bug in AMDs drivers which cause problems on Ubuntu 12.04. It might be either due to different compiler (4.4 vs 4.7) or kernel (2.6 or 3.2). I used 12.8 drivers on both systems. The question is, will AMD try to fix this? and if yes, how can I help?

                                                                                                                             

                                                                                                                            I will return back about the performance results. One thing at a time!

                                                                                                                          • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                            liwoog

                                                                                                                            Just to say that all my codes are now running fine on the HD 7970 and they run 2x faster than on the GTX 680. We have now installed 40 cards in 10 machines.

                                                                                                                            1 of 1 people found this helpful
                                                                                                                              • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                                yurtesen

                                                                                                                                My program is also working now, I think it was a driver issue at some point. But I got a performance hit on some of my programs with 13.1 drivers and 12.10 work much better.

                                                                                                                                 

                                                                                                                                Also I realized that catalyst does not seem to update the runtime version properly. I am not sure if it effects anything but by removing catalyst installing app sdk, then re-installing catalyst I am getting OpenCL 1.2 AMD-APP (1113.2), but otherwise catalyst does not seem to be updating the versions somehow

                                                                                                                                  • Re: Tahiti 7970 lockup no problem in 5870 or Nvidia devices...
                                                                                                                                    himanshu.gautam

                                                                                                                                    Hi yurtsen,

                                                                                                                                    I have seen many people complaining about performance dropdown with 13,1 driver. I have already this issue with a particular test case. But it would certainly help, if i can attach more testcases here so that a more appropriate solution can be found.

                                                                                                                                    Can You please attach some testcase which can show the performance dropdown (probably the hsaigol.zip file is appropriate, please confirm). Also let me know the system Details (OS, 32-64 bit, GPUs present, perf observed with 12.10 and 13.1 driver).

                                                                                                                                    Thanks for your support.