cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

yurtesen
Miniboss

Tahiti 7970 lockup no problem in 5870 or Nvidia devices...

I have a very simple program. The data is transferred into GPU memory in the beginning of the

program, and the main program just queues kernel runs with different starting offsets (waits for

previous run to finish before queing another run). When the whole range is executed, it reads the

results from the card.

It is sad because AMD's GCN and even older cards beat Nvidia counterparts greatly in performance of these calculations, yet something simply do not function properly so we are forced to use Nvidia hardware

Problem1:

The program enqueues a kernel run, wait for it to finish and enqueue another. There is significant time when starting kernel runs (on Tahiti) even though there shouldnt be any data transfer at all.

A case with global size of 430932 (rounded up to 431104) takes 36.7 seconds

to run when the kernel is enqueued once. If the kernel is enqueued with global size of 50000 and using offsets (rounded to 50176) and run in 9 pieces, the total runtime is 44.2 seconds. The overhead is almost a second per kernel enqueue on Tahiti. On Cypress the difference is only 42.5 vs 44 seconds

Note: In the program, block size is set from defs.h file using variable WORKSIZE

Problem2:

This works fine for example if the whole range was executed all at once. Lets say 0 to 4000000 all at once,

but if I try 0-50000, 50000-100000, 150000-200000 etc. then after 8-10 enqueues it gets stuck and the only

way to recover is rebooting the box. (yes the global size was rounded up to multiple of worksize)

This happens only with Tahiti and NOT with for example Cypress (5870) or Nvidia Tesla cards.

I am providing the source code if anybody wants to have a look:

Program:

http://users.abo.fi/eyurtese/amd/galaxyz.tgz

Data Files:

http://users.abo.fi/eyurtese/amd/galaxy_data.tgz

The program includes a small Makefile, it should be easy to run (might require editing). If you

have any problems, please let me know.

Variables for selecting the card etc. are stored in defs.h

The program is ran using the following command line: (use the correct paths to data files)

430k test cases:

./reference ../data/m.txt ../data/m_r.txt out.txt

or

./reference ../data/m_s.txt ../data/m_r_s.txt out.txt

4300k test cases:

./reference ../data/m_huge.txt ../data/m_huge_r.txt out.txt

or

./reference ../data/m_huge_s.txt ../data/m_huge_r_s.txt out.txt

The difference between normal files and _s files are the data is shuffled with the _s files which improves performance slightly.(not relevant to the problems). But

you can use any test you like. There are also 50K sized test files which I use for very quick runs only.

an example run output is below, and under that are the problems listed:

------------------------------------------------------------------------------------------------------------------------------------------------

1 platform found:

-------------------------------------------------------------------------------

platform 0*:

        name: AMD Accelerated Parallel Processing

        profile: FULL_PROFILE

        version: OpenCL 1.2 AMD-APP (923.1)

        vendor: Advanced Micro Devices, Inc.

        extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

-------------------------------------------------------------------------------

* - Selected

Devices of type GPU :

-------------------------------------------------------------------------------

0*      Cypress

1       Tahiti

-------------------------------------------------------------------------------

* - Selected

Device 0 log:

"/tmp/OCLyYW4br.cl", line 150: warning: null (zero) character in input line

          ignored

  ^

Warning: galaxyz kernel has register spilling. Lower performance is expected.

../data/50k.txt contains 50000 lines

     first item: 52.660000 10.900000

      last item: 10.620000 40.070000

../data/50k_r.txt contains 50000 lines

     first item: 76.089050 32.209370

      last item: 80.482910 22.944120

Total time for GalaXYZ input data MPI_Bcast =    0.0 seconds

Real 50000 Sim 50000 Hist 257

Getting total number of worker threads

Total number of worker threads 1

Slave node 0 thread 0 sending 0

Master node 0 waiting

Master node 0 received id 0 thread 0

Master node sending 0 25000 to node 0 thread 0

Slave node 0 thread 0 waiting

Slave node 0 thread 0 received 0 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 0 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 25000, remaining lines 0

Master node 0 waiting

Slave node 0 thread 0 offset 0 length 25000 events 1 time 0.36 seconds

Slave node 0 thread 0 finished 0 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Master node 0 received id 0 thread 0

Master node sending 25000 25000 to node 0 thread 0

Master finished. Starting exit procedure...

Slave node 0 thread 0 received 25000 25000

Node/Dev 0/0: Running OpenCL GalaXYZ on Device 0

Node/Dev 0/0: Maximum workgroup size 256

Node/Dev 0/0: Using workgroup size 256

Node/Dev 0/0: Using offset 25000 global 25088 with vector size 1

Node/Dev 0/0: First kernel processes 25000 lines with localmax 50000, remaining lines 0

Slave node 0 thread 0 offset 25000 length 25000 events 1 time 0.23 seconds

Slave node 0 thread 0 finished 25000 25000

Slave node 0 thread 0 sending 0

Slave node 0 thread 0 waiting

Sending exit message to node 0 thread 0

Slave node 0 thread 0 received -1 -1

WALL time for GalaXYZ kernel =    0.6 seconds

MPI WALL time for GalaXYZ kernel =    0.6 seconds

CPU time for GalaXYZ kernel  =    0.6 seconds

Doubling DD angle histogram...,  histogram count = 1422090528

                                 Calculated      = 711020264

                                 >=256           = 538954736

                                 Total           = 1249975000

DR angle                         histogram count = 194504329

                                 Calculated      = 194504329

                                 >=256           = 2305495671

                                 Total           = 2500000000

Doubling RR angle histogram...,  histogram count = 18528234

                                 Calculated      = 9239117

                                 >=256           = 1240735883

                                 Total           = 1249975000

------------------------------------------------------------------------------------------------------------------------------------------------

0 Likes
83 Replies

Can you attach the results you are getting from the 4.3m case? Are you getting similar/same results on consecutive runs? (I guess results must be same)

I attached the output from an nvidia tesla card using the same version of the program. I am not able to run it on AMD cards at all anymore. Coincidentally, with the latest drivers I have, it also doesnt work on Cypress and it crashes.

Also, this version is not the fastest version. I have a version with vector elements, which can do 4.3m case in ~510 seconds on Tahiti.  On a Tesla M2050, the same computation takes 4500 seconds (timings are best timings from different optimized versions for Tahiti and Tesla).

I didnt attach vector versions of the code because these versions were simpler and probably better for debugging the issue of locking.

Do you have access to Linux workstation with Tahiti? or do you see any visible problems in memory allocation etc in the code? Am I the only person who cant run my own code?

The register spilling message seem to come on cypress but not on tahiti(as far as I can remember)... I wasnt worried about that yet, since it was crashing

0 Likes

Am I the only person who cant run my own code?

Yes, this is the way the world works.

Do you have access to Linux workstation with Tahiti? or do you see any visible problems in memory allocation etc in the code? I want to put Linux on the system but that will take a little time. No, the code looks very straight forward and is easy to work with, I can't see it causing the hang.

I will post some huge data shortly, after a few more runs. I got your data file, thanks.  BTW, I do see one way that data output can vary but probably not from run to run, I think you must know this and its from troubleshooting. When the data is broken into parts that are not divisible by 256, the kernel's length is rounded up to a multiple of 256. Then on each run (but not the last) the last few workitems will add a little more to the histograms (eg. chunk size = 25000, kernel work length = 25088, so 88 WIs will add to the histograms. I noticed that 8*1024, 16*1024, and 32*1024 chunks always give the same (and minimum) answers, which are different from the 25000 chunk size results. I don't see how that would change from run to run though.

510 seconds vs 4500 is real impressive, which is why this problem must be solved!!

My guess is the drivers, multiple Tahiti's have seemed somewhat troublesome.

0 Likes

Yes, I know about the extra histograms added in that test program (due to quick hack) It originally did all the iterations in one go, so there was no need to check those extra threads hanging in the end of each run. I have that covered in MPI versions.

I had to divide the run into smaller pieces (and run with MPI) because I am running the program on a cluster with many nodes. I needed to send jobs piece by piece to nodes. Furthermore, the problem gets smaller near the end (2 of the inner loops start from i+1). I decided to send small pieces to nodes so load will be balanced. Since if I divided the problem into number of nodes, the last nodes would have finished very quickly and sit idle.

drallan wrote:

510 seconds vs 4500 is real impressive, which is why this problem must be solved!!

My guess is the drivers, multiple Tahiti's have seemed somewhat troublesome.

You are right, and we are going to write a paper about this, which will probably benefit AMD also. But if I cant get these programs to run properly, then well nvidia will win

Since you seem to be interested and very helpful, here are some fun facts for you The program calculates two-point angular correlation function, my code is different (I follow a slightly different method), but there is an explanation of it here (and that paper is old but those guys even made code for FPGAs! 😞

http://www-vm00.ncsa.illinois.edu/~kindr/projects/hpca/files/gpgpu09_presentation.pdf

http://www.ncsa.illinois.edu/~kindr/projects/hpca/files/ECE498AL_problem_statement.pdf

I have now attached float4 and float8 versions of the code, you should be able to compile them also. These do not have any problems These work perfectly on everything from amd/nvidia and intel (I just couldnt get it working on PlayStation3 ). The Nvidia GPUs do not seem to like vectors so these are not the best codes for them. The float8&fx-8150 cells are empty because AMD SDK crashed when making AVX code with float8. I dont know if it will be fixed or when... but AMD confirmed the problem... (it was easy to reproduce, since kernel even caused kernelanalayzer to crash )

430937 Lines – NormalAMD FX-8150
AMD SDK
AMD FX-8150
Intel SDK
CypressTahitiTesla M2050GTX580(oc)GTX680I7 980
Intel SDK
X5650
AMD SDK
ocl6_float4_v3_ulong_amd323.3325.7717.4312.9117.2670.7488.48244.56475.48
ocl6_float4_v3_ulong_amd_jancos839.83304.3712.756.8112.265.0889274.26308.77
ocl6_float8_v3_ulong_amd_jancos

11.626.75137.381.22112.23273.28377.4

At this point, I also think there is extra problems due to OpenMP also. Strangely nothing I have with OpenMP is working at all. I get wrong results... They used to work earlier. I will have to debug everything again... I will post updates if I find more.

0 Likes

Ah, the Slone digital sky survey, yes it does all become clear. 230 million galactic entities so you need to

convolve the entire universe, only Tahiti can do that . The Slone survey is impressive.

I attached my 'huge' data output. It is slightly different than yours on the order of about ~1/10000.

I can produce the same differences  with most any alteration of execution order (as you mentioned earlier).

The next three files are:

  1. Tahiti output from the 430K problem.

  2. Same, but using the ocl fma instruction in place of the sum x*x + y*y + z*z.

  3. Same, no fma, but reversing the sum order, i.e., z*z + y*y + x*x. (only for the DD histogram,). again similar differences.

So it seems these differences are from "binning noise". Still not sure how that would happen between runs.

I will look at the faster vector programs.

At 6.7 seconds,you are probably memory bound by Tahiti's 390GBs/sec global bandwidth.

0 Likes

Your outputs all look good. Some strangeness was expected due to extra threads doing extra calculations in the end of each step due to the quick hack

From MPI versions and float4/8 versions I am getting exactly 100 times larger values which is perfect.

Just that the Tahiti and also now Cypress is crashing on me after several kernel enqueues. Hsaigol said he gets it working with 'latest internal drivers'. I would like to get my hands on those latest internal drivers!

0 Likes

drallan, one more thing... is it possible for you to check the 'problem1' in my first post in this thread ?

0 Likes

drallan, one more thing... is it possible for you to check the 'problem1' in my first post in this thread ?

Hi Yurtesen,

Here's some data that makes me think the time difference may not be due to ocl buffers.

1. I defined 6 ordinary device buffers (flag=0) and manually wrote them to the card before running the kernel, and did not see any difference.

2. here are three runs of the 430K problem where the only difference is the size of the kernel run. (all times are slightly faster because I rearranged the order of memory reads in the kernel, not relevant to this data). I see the same kind of slow down for 25000 chunks but   the 32K chunks actually run faster! . This makes me think the time difference is due to something like memory access patterns, cache, etc.

Kernel size         Run time

  431104               8.0 sec.   baseline, one large single block

  25000                 8.7 sec.   multiple pieces, shows same slowdown as in problem 1

  32768                 7.6 sec.   binary power of 2 happy size

0 Likes

Sorry for the delayed answer. I had to deal with a pile of unnecessary stuff recently

I get 34 seconds with single run of 430932, and 38.5 seconds when I run with 32768 steps. and 36.8 seconds with 25000 steps... on Cypress

From the point of kernel, memory access shouldnt be much different than running in one piece, no?

Because for example if we had i=1,2,3,4,5 then i=6,7,8,9,10 (1 to 5 then 6 to 10) , compared to i=1,2,3,4,5,6,7,8,9,10 exactly same operations will be done? are the threads starting randomly?. We are even queuing with order and not doing things like 6,7,8,9,10 then 1,2,3,4,5.

The kernel takes at least 2-3 seconds, it is large enough to even things out. How do you think I can debug this issue? Any pointers? I need to know why the performance is so variable.

Did you also try to queue all the kernels then flush the queue without waiting for each other to end? (I wonder if the events somehow adding some delays?

0 Likes

I also ran Cayman, and 32768  threads is still a little faster than 25000, which is different from your Cypress data. I didn't run the 430K  on Cayman because it times out.

           430K  32768   25000            

Cayman     ---    37.1    40.7

Tahiti     8.0     7.6     8.6

I would assume your right that these algorithms should run about the same and memory access patterns should be about the same. Although, 32768 is a real sweet spot for Tahiti's ALU as long as latency is not a problem.

A lot of things can make small changes, some of which can vary from one machine, OS, or driver to the next. I even saw that dragging the dos prompt to a different monitor is worth about 0.4 seconds.

Did you also try to queue all the kernels then flush the queue without waiting for each other to end? (I wonder if the events somehow adding some delays?

Yes, maybe that gives a good clue, it slows down (not waiting for each kernel to finish) from 7.6 to 7.9 seconds, so perhaps larger numbers of threads can be slightly less efficient. Of course they are wonderful for memory latency.

The big question is how can you get your Tahiti's running???

drallan

0 Likes

Have you considered if you might get better speed for example if you used 40000? simply because it is a longer run?

Well, I guess it might be a relief that Nvidia does even worse:

25k steps 94.0s

32768 steps 85.8s

430932 step 66.5s

(queuing all at once or waiting for kernel runs to finish does not seem to be making any difference on Nvidia)

Tomorrow I might try to run it through profiler on Nvidia I guess...

drallan wrote:

The big question is how can you get your Tahiti's running??? 

Good question, but the large data is nowadays failing on Cypres also. It used to work perfectly fine when I created this thread! Also Hsaigol says he is also able to run the program on Ubuntu 10. Maybe the best thing I can do is to install Ubuntu 10 to a USB stick and test on that.

By the way, I was wondering, how difficult is it to setup cygwin to compile and run this program? There is so much I can try and I have so little patience left

0 Likes

yurtesen, this looks like a great exercise for optimizing OpenCL on GCN.  The layout of this thread makes it a bit tricky to work out which is the current suggested experimental code.  For running under MPI on two 7970's, should I start with the code and data in your original post,

../eyurtese/amd/galaxyz.tgz,  ../eyurtese/amd/galaxy_data.tgz ?

In our own codes, we've been having some >fun< with events, queues and timing.  Sorting out your issue may help us all learn a thing or two about GCN.

0 Likes

For debugging, it would be best to look at this code (I posted it to drallan earlier):

http://devgurus.amd.com/servlet/JiveServlet/download/1283848-1936/ocl1_orig_jancos_steps.tgz

It is a quick hack WITHOUT MPI or OpenMP and simply in opencl.cpp file I have made a loop which enqueues kernels in pieces with offsets instead of giving all at once for running. It crashes on a single card also (at least on my machines). I would be happy to hear if it works for you or not.  The file ./eyurtese/amd/galaxy_data.tgz includes the input data you will need to run the program.

0 Likes

--Have you considered if you might get better speed for example if you used 40000? simply because it is a longer run?

It runs just a tad bit slower, not much though.

By the way, I was wondering, how difficult is it to setup cygwin to compile and run this program? There is so much I can try and I have so little patience left

Cygwin should be fairly easy to install, I believe there is a setup program that downloads and installs everything for you.

I mostly use a bare mingw installation unless make files require a shell, then use cygwin or msys.

FWIW, my multi Tahiti system had problems with  driver upgrades for a very long time. I usually run the original, old drivers that came with the cards, or the more recent drivers since 8.98.2 seem better. I assume that certain Tahiti configurations were problematic for a while.  Is it easy to install drivers on Linux? Perhaps you could try both old and new and maybe one in the middle. On the other hand, if they all fail the same way, then maybe the problem is elsewhere? Just a thought.

0 Likes

drallan wrote:

  Is it easy to install drivers on Linux? Perhaps you could try both old and new and maybe one in the middle. On the other hand, if they all fail the same way, then maybe the problem is elsewhere? Just a thought.

I just had a blast from the past and installed ubuntu 10.04 on a usb stick and I am trying to install amd drivers now. I think the driver version is not very relevant since hsaigol told that he can run it both with latest drivers and also with internal drivers. Which makes me think that the problem might be something to do with kernel version or X version. I will update the thread after I run tests.

0 Likes

i am currently testing again on 10.04

i'll get your timing information for the following with the 4.3 million file

all 4.3million  together

in steps of 25000: 1750s

in steps of 32768

in steps of 430932

in steps of 50000

Here is data for 430k since its much faster and i need to head home now. will complete 4.3million later

in steps of 25000: 17.4s

in steps of 32768: 15.8s

in steps of 50000: 14s

all 430932 together: 10.8s

i dont know how exactly you are timing your code, may be the execution time is the same but the total program time is different due to some overheads

the results for this exercise will be on diff sku of tahiti which has different clocks so the data will not be apples to apples compared to yours

but all my runs will be on the same card/setup so you can compare them

also i have noticed that the uninstall of the drivers is horrible, i just end up reimaging the hard drive everytime i have to switch the driver.

so yurtesen i would recommend you make a clone of the usb after installing linux on it so that you can try different drivers without having to go through reinstall, just reclone with clonezilla or something similar

also note i install the following items on my 10.04

apt-get install mpich2 openmpi-bin openmpi-doc libopenmpi-dev g++
AMD-APP-SDK-v2.7-lnx64.tar --> official website

amd-driver-installer-12-8-x86.x86_64.zip --> offical website

following change in Makefile
remove -lmpl from: LD_FLAGS= -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich -lmpl -lgomp

LD_FLAGS= -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich  -lgomp

0 Likes

yurtesen read the conclusion in this post, does that align with what you're seeing

http://devgurus.amd.com/message/1282801#1282801

also drallan how are you getting such amazing speedups when you break the worksizes

           430K  32768   25000           

Cayman     ---    37.1    40.7

Tahiti     8.0     7.6     8.6

???


0 Likes

hsaigol wrote:

yurtesen read the conclusion in this post, does that align with what you're seeing

http://devgurus.amd.com/message/1282801#1282801

also drallan how are you getting such amazing speedups when you break the worksizes

           430K  32768   25000           

Cayman     ---    37.1    40.7

Tahiti     8.0     7.6     8.6

???


I am not sure how to compare, that post seems to refer to multi-gpu implementation of dgemm which require some transfers from GPUs etc. I do not require transfers to anywhere between kernel runs.

About drallan's code, I think he shouldnt have gotten 8 seconds with that code he had. It is truely amazing (I just realized how small the numbers he had were yesterday!). However he is running a different version than what you have (no mpi oe openmp). See ocl1_orig_jancos_steps.tgz file I posted to him.

I will have to run the exact same code which I have given to him on Tahiti also. I will try to test it all today with Ubuntu 10.04 etc. I will update

0 Likes

hsaigol wrote:

also drallan how are you getting such amazing speedups when you break the worksizes

           430K  32768   25000           

Cayman     ---    37.1    40.7

Tahiti     8.0     7.6     8.6

???


Ah, yes. The original thread was about why the algorithm crashed only on Tahiti, and why it only crashed on the author's machine. In the beginning, I added a fairly simple optimization that improved the execution of smaller chunks as I understood the ultimate target was a distributed network using chunks. Those are the numbers I have been reporting. I mentioned this but not very clearly. Now that the thread is turning towards optimization, it's good you asked the question. My full set of numbers is:

           430K  32768   25000           

Tahiti     7.2     8.4    10.1     Author's original algorithm without MPI (NoMPI)

Tahiti     8.0     7.6     8.6     NoMPI with chunk optimization

Tahiti     8.0     7.3     8.4     NoMPI, same chunk optimization but cleaner

Cayman     ---    37.1    40.7     NoMPI with chunk optimization on Cayman

(Tahiti 1200MHz, Cayman 950MHz)

The optimization combines loops that calculate the angles, where possible, to prevent re-referencing the same area of global memory, this should be  cache friendly for  small chunks. Other than that, 32768 is exactly 8 waves which fully utilizes the CUs without a large number of waiting threads. I think though that yurtesen probably has some better versions of the algorithm.

0 Likes

drallan wrote:

I also ran Cayman, and 32768  threads is still a little faster than 25000, which is different from your Cypress data. I didn't run the 430K  on Cayman because it times out.

           430K  32768   25000            

Cayman     ---    37.1    40.7

Tahiti     8.0     7.6     8.6

The big question is how can you get your Tahiti's running???

I thought your tahitis were overclocked to 1.2ghz? I am getting better results with MSIs 1010mhz tahiti card when 430932 step is used (these are from ubuntu 10.04). Yet my results get slower and slower...

GPU
430K5000032768250005000
Tahiti 1010mhz MSI7.2s10.6s11.2s13.1s38.4s

here are step sizes and kernel run times:

50k case runs 9 times      (10.6 - 7.2 ) / 9  = 0.38s

32k case runs 13 times    (11.2 - 7.2 ) / 13 = 0.31s

25k case runs 18 times    (13.1 - 7.2 ) / 18  = 0.33s

5k case runs 87 times      (38.4 - 7.2 ) / 87 = 0.36s

From this figure, I can say that each kernel run has a ~0.35s delay. This cant be a coincidence right? (although I dont know how drallan is getting those crazy results yet )

0 Likes

From this figure, I can say that each kernel run has a ~0.35s delay. This cant be a coincidence right? (although I dont know how drallan is getting those crazy results yet )

yurtesen, congratulations on isolating the problem.

Crazy numbers, please see my previous post, all my numbers have been for that code.

I also see your 7.2 at 1010Mhz, is the same as my 7.2 at 1200 both using the original code! Sigh.

0 Likes

Here is the kernel I have been using, it assumes that 'real' and 'sim' sizes are equal. If not, a slightly more complex structure is needed.

There is a #define to switch back to the original version.

I wonder if this might explain your constant 0.35 second time.

It might relate to extra work for each chunk in the original version that is not done in the optimized version.

Attached kernel file, for reference, Tahiti, 1200MHz., 32768 thread, is 7.3 seconds.

0 Likes

drallan wrote:

Here is the kernel I have been using, it assumes that 'real' and 'sim' sizes are equal. If not, a slightly more complex structure is needed.

There is a #define to switch back to the original version.

I wonder if this might explain your constant 0.35 second time.

It might relate to extra work for each chunk in the original version that is not done in the optimized version.

Attached kernel file, for reference, Tahiti, 1200MHz., 32768 thread, is 7.3 seconds.

drallan wrote:

Here is the kernel I have been using, it assumes that 'real' and 'sim' sizes are equal. If not, a slightly more complex structure is needed.

There is a #define to switch back to the original version.

I wonder if this might explain your constant 0.35 second time.

It might relate to extra work for each chunk in the original version that is not done in the optimized version.

Attached kernel file, for reference, Tahiti, 1200MHz., 32768 thread, is 7.3 seconds.

In reality they are always equal in all my tests but the sample code I got had 3 different loops so I thought I should take care of that.

I guess even when I run all at one go, there is probably a 0.35s delay when kernel starts. I am just not able to measure it. I am not sure if a simple if statement can cause 0.35s delay...

I will test your code later (it might take a few days, the machine is unavailable now, I have to go and boot it into linux)

0 Likes

With your code, I get 9.9s for 18 enqueues vs 8.8s with single enqueue. The difference is much less, but I dont understand how can an if statement cause this. Is there an explanation? You said "extra work for each chunk" but arent these kernels are run by each thread independent of if the kernel was queued in a single go or not? Is there any documentation which explains this?

0 Likes

yurtesen wrote:

With your code, I get 9.9s for 18 enqueues vs 8.8s with single enqueue. The difference is much less, but I dont understand how can an if statement cause this. Is there an explanation? You said "extra work for each chunk" but arent these kernels are run by each thread independent of if the kernel was queued in a single go or not? Is there any documentation which explains this?

When adjusted for 1200/1010 MHz,  your numbers are the same as mine and I re-checked to make sure I didn't scramble anything. So the data looks real. Then you should get about 8.1 sec for the 32768 size run (even more  confusing!)

So your question is valid. Whether run in chunks or whole, it seems each workitem would read the same data. Where does the difference come from? I think there are  2 parts to the answer.

In both programs, each WI reads a wide range of  memory which is broken into 2 sets of X,Y, and Z blocks for both real and sim data. In your case, two of the loops read the same region of memory (in sim data) where one loop reads the entire range then the next loop goes back and reads the same range again. This is what I meant by extra work. If the loops are combined, this can reduce to a single read per WI. Thus the single loop version should be more efficient in general.

The second part, why this seems to help smaller chunks probably depends on cache behavior, which can be complex. One point is that the chunk sizes are similar in size  to the L1 and L2 caches ( roughly about 1/2MB), so one might expect to see differences. My guess was that a perfect 8 wave 32K sample would be favored.  Then of course there is always some question about ocl drivers and interface adding something on top

0 Likes

I just downloaded AMD's new CODEXL tool and ran profiles that show cache and memory activity for the programs.

Cache seems to be the biggest factor, small 3-loop runs have low cache hit rates while small 1-loop runs run best

from the cache. Do you think these low cache hit rate can explain the constant delays?

Name    KernelSize  CacheHits(%) AvgFetchs/WI

---------------------------------------------------  

rg.exe    430000      87.        2585077

r25.exe    25000      62.        2550000

r32.exe    32768      55.        2550000

rog.exe   430000      49.        1939032  oneloop

ro25.exe   25000      99.        1930000    "

ro32.exe   32768      96.        1940000    "

0 Likes

Thanks drallan, I am trying to run codexl myself now on Linux...

This could maybe explain the issue somewhat. Perhaps your kernel is effected less since it requires less fetches.

I will have to run some tests and return back to you. It is still strange, because I believe data shouldnt fit to 2-3mb cache. I will also test breaking the kernel and making all loops start from i=0 etc.

Do you know if there is any mechanism in GPU which can prefetch data? (I know very little about how the caching works on the GPU). Anyway, I have a lot of things to test now, I will be back when I have enough information to figure this out without any doubts

0 Likes

Hello Drallan, I guess you did it again I am getting same results as you get. ( + - few percent). Therefore I do agree that cache behavior must be related to this difference in speed.

Although, it is very strange to think that it would effect this much. After all, 0.3 seconds average extra per kernel run is quite long.  It would take less time to read GPU memory start to end... but I am running out of time and this explanation will do fine

Thanks for your help and it is amazing that you have dedicated so much time for helping....

0 Likes

Yurtesen,

In the beginning I thought (and maybe you did to) it must have something to do with the ocl drivers, kernel launch, or buffers, which does happen. In the end though now I don't think so, so it was a great exercise. Your algorithm though is a beautiful example for Tahiti, the architecture that will rule the world after the traffic light controller

0 Likes

Hi hsaigol, after some consideration. I think the differences in results is caused by OpenMP somehow. It doesnt even work properly on Nvidia cards now. I recommend trying the non-openmp/mpi version quick hack I have posted to forum. It crashed as well but the results should be same after each run with same workgroup size.

I have installed amd-driver-installer-8.982-x86.x86_64.run and in my xorg log I see

compiled for 1.4.99.906, module version = 8.98.2

I will try to figure out the issue and fix it in the following days. I will let you know if I can fix it or not.

0 Likes
liwoog
Adept II

I am finally hopeful that my code is running properly. What I learned to make it work:

1) Make sure to use clFlush after all queuing as the AMD implementation does not seem to allow a kernel parameters to be changed and the kernel requeued before it is flushed.

2) What was killing me: waiting for events on different queues does not seem to work. I had two queues waiting for events on one another and clEnqueueBuffer events did not properly wait.

3) FInally, because of mixed GPU environments, I was allocating a context per GPU instead of a context per platform. The NVIDIA implementation did not care, but the AMD one did.

0 Likes

sorry didnt get any testing done today, I am totally swamped with work. may be tomorrow i'll get some time

0 Likes

Hi hsaigol, thanks for all your help. Can you tell what version of the driver do you have?

I think I know why you were getting different results.In opencl.cpp file near line 356 there was a bug. I am very sorry for that!

The last 2 .enqueueWriteBuffer lines were in wrong order.  I fixed it and updated the link at first post:

http://users.abo.fi/eyurtese/amd/galaxyz.tgz

I attached a zip file to this post which has the corrected opencl.cpp and the expected output files for 430k and 4300k cases as well. (the only difference in output is that 4300k case has 100 times larger output)

Thanks,

Evren

0 Likes

Hi Yurtesen<
The version posted above works and i get consistent outputs that match your posted results.

to get the program to compile i had to make the following change in the makefile (happened for previous version too)

#LD_FLAGS= -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich -lmpl -lgomp

LD_FLAGS= -Wall -lm -L./lib -lOpenCL -L/usr/lib64/mpich2/lib -lmpich  -lgomp

I'm using driver: 9.01-120904a

time taken to complete the 4.3million line file on a tahiti ghz edition: 1450 seconds

how good or bad is that compared to the top nvidia card you are using?

0 Likes

hsaigol wrote:

The version posted above works and i get consistent outputs that match your posted results.

I'm using driver: 9.01-120904a

Thanks and sorry for the earlier bug in the program (2 lines caused so much trouble!), so there is probably bug in current Linux drivers that I have (btw Cypress was also crashing)

How can I get hold of 9.01 ?

Also one more thing, I used to see a significant time difference between running the program with 50k steps, or all at one. Please see my 'Problem1'  in my first post of the thread. Is this fixed also?

0 Likes

Hi,

don't worry about the trouble at least it helped you fix your code so thats great

i understand what your question is but how do i test it, what do i need to modify in the code (better if you provide it) so that i have a version which runs through all the execution without subdivisions

if you want me to do the comparison you have to provide the other version so i can run it and see if the problem is fixed.

lastly i can test this on one of the dual GPU boards so if you want to provide a version of the code which splits the subset work onto multiple gpu's i can try that too.

i'm in the office for another 10-15mins and i'm hoping you can provide the code so i can test it before the weekend

as for when is the driver branch being released, i'll ask the SW team and get back to you on that. I do want do double check and see if the drivers on the amd webpage hang for me as well in my setup.
sorry but i can't provide you with the internal drivers. There is one really neat features that are coming though I dont know if i'm allowed to write about it

0 Likes

hsaigol wrote:

i understand what your question is but how do i test it, what do i need to modify in the code (better if you provide it) so that i have a version which runs through all the execution without subdivisions

if you want me to do the comparison you have to provide the other version so i can run it and see if the problem is fixed.

lastly i can test this on one of the dual GPU boards so if you want to provide a version of the code which splits the subset work onto multiple gpu's i can try that too.

Actually, in defs.h there is a WORKSIZE define, you simply should set it to number of lines in the input then recompile, for example: 430932

When running, program will send a single work and then realize that there is nothing more. So you will have a version which enqueues the kernel 1 time only with minimal change to the code (therefore perhaps it is easier to see what is going on). Under normal operation there are some idle threads in each kernel run, (due to rounding size to multiple of worksize) but that shouldnt cause very great performance difference obviously.

I dont understand how come you are able to run the program without crashing. I think it is more or less sure now that there is not a bug in the program itself... But can you test it with Ubuntu 12.04? Ubuntu 10.04 is EOL next year (desktop version, the server version is ending on 2015 but still, nobody will install 10.04 today when making new systems), isnt it logical to test on 12.04 also?

https://wiki.ubuntu.com/Releases

0 Likes
hsaigol
Adept III

Hi Yurtesen,

I have some good news (bad for you i'm guessing)

The news is that i was able to run your new program "http://users.abo.fi/eyurtese/amd/galaxyz.tgz" without any hangs with the 4.3 million line input using graphics drivers from the official AMD website

driver: 8.982-120727a-144949C-ATI

OS: Ubuntu 10.04

Kernel: 2.6.32-33-generic x86_64

on a clean install of ubuntu the following addition libraries/programs were installed

apt-get install mpic2

apt-get install openmpi-bin openmpi-doc libopenmpi-dev

apt-get install g++       (v4.4)

after this i installed the AMD APP SDK

setup the paths

and ran your program and it completed without any issue

I even scoped the voltage rails on the board while the app was running to try and capture the failing moment but it never failed

it took me 1492 seconds to complete the program, during which time the display, console and mouse is almost compeltely non-responsive. So i just wait and let the system run since i knew the completion time from before and voila after 1500seconds the program was complete and everything returned to normal.

First things first... I ran the program from USB with Ubuntu 10.04 and it indeed does run without crashing. Therefore I feel that there is a driver bug in AMDs drivers which cause problems on Ubuntu 12.04. It might be either due to different compiler (4.4 vs 4.7) or kernel (2.6 or 3.2). I used 12.8 drivers on both systems. The question is, will AMD try to fix this? and if yes, how can I help?

I will return back about the performance results. One thing at a time!

0 Likes
liwoog
Adept II

Just to say that all my codes are now running fine on the HD 7970 and they run 2x faster than on the GTX 680. We have now installed 40 cards in 10 machines.

My program is also working now, I think it was a driver issue at some point. But I got a performance hit on some of my programs with 13.1 drivers and 12.10 work much better.

Also I realized that catalyst does not seem to update the runtime version properly. I am not sure if it effects anything but by removing catalyst installing app sdk, then re-installing catalyst I am getting OpenCL 1.2 AMD-APP (1113.2), but otherwise catalyst does not seem to be updating the versions somehow

0 Likes