cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dnorric
Journeyman III

n-body particles and fps

Hi i have a 4570 and are not impressed at all with the n-body performance. The particle count seems limited to 1024 before the fps get too slow. How many particles can the 5870 do at a reasonable fps.

Cheers

Damian

0 Likes
39 Replies
eduardoschardong
Journeyman III

Well... It depends a lot on what is too slow for you and what's impressive... For me it was ok (HD5k) up to 8192 but I have no parameters for calling impressive or not, what's your parameters?

 

0 Likes

I would like to do atleast 100000 particles at 30fps. I have seen a demo of nvidia on youtube doing 1 million at 30fps this would be ideal. Is this feasible with ati and what rig

0 Likes

Was the demo related to this: http://progrape.jp/cs/ ?

That work won a 2009 Gordon Bell Prize! The YouTube video I saw was a demo on a single PC that was part of this. The case was open and a desk fan was blowing air into the computer to keep it cool. It was really funny.

I've never worked with N-body problems but did skim a paper about this work. The techniques required are non-trivial. I can well imagine that using more theory and better computer science would give multiple order of magnitude increases in performance compared with a naive or unoptimized implementation. It's not quite a fair comparison of an advanced research code (is it publically available? probably not) with a SDK sample demo.

The other issue is that if there are space/time tradeoffs with N-body simulation, this is a legitimate weakness right now for ATI GPUs. It's an OpenCL driver limitation on using full device memory. I believe if you are using CAL IL/ISA, then full capability is available.

0 Likes

I have never seen that work before. I am doing a bachelor of physics at the moment and would like to model some ideas. I dont have all the info or knowledge on how to do it yet. I just want to know if its feasible computational wise.

At the moment my 4570 using the nbody sample from the ati sdk is quite slow at the default particle count which i think is 1024. I need a minimum a number of particles say around 20000 preferably more at a frame rate of 30fps.This requirement is just for the initial stage of the project and will more than likely become allot more computational intensive.

I start holidays tommorow and 5 weeks break so will be wetting my apetitie with opencl part of the time on my 4570.I will have to drip feed it time after then

Anyhow could anyone have a guess as to what kind of rig i would need.

Cheers

Damian

0 Likes

Originally posted by: dnorric I would like to do atleast 100000 particles at 30fps. I have seen a demo of nvidia on youtube doing 1 million at 30fps this would be ideal. Is this feasible with ati and what rig

But that is not going to happen with a simple n-body algorithm using exact force calculations. You would need better scaling algorithms like a tree method or something like that.

The reason is that you need N² force calculations for N particles. That means for 100,000 particles this equals already 10,000,000,000 (10 billion) force calculations per timestep. Even with the most simplest potential (soft core) and the lowest flops count (20 or 22 flops, generally one attributes more to it because of the inverse sqareroot) you need more than 200 GFlop per timestep. For 30 fps one would need more than 6 TFlop/s (realistically something like 10 TFlop/s). So maybe a quad crossfire system of HD5870s may get close

I'm really happy with the performance of my n-body code on a HD5850 (using "expensive" error function potentials). It is currently roughly the same speed as 128 Nehalem Cores at 3GHz (there is also a parallelized CPU version running on a HPC Cluster).

0 Likes

Originally posted by: Gipsel So maybe a quad crossfire system of HD5870s may get close


With the CAL++ code it would require almost exactly 3 5870 to achive 30 fps with 100k bodies. There is also an option of using 2 5890 .

 

 

0 Likes

Originally posted by: hazeman
Originally posted by: Gipsel So maybe a quad crossfire system of HD5870s may get close


With the CAL++ code it would require almost exactly 3 5870 to achive 30 fps with 100k bodies. There is also an option of using 2 5890 .

But you have some additional overhead because you have to copy the particle data between all GPUs each timestep 

0 Likes

Originally posted by: Gipsel

But you have some additional overhead because you have to copy the particle data between all GPUs each timestep 

 

True but it's insignificant ( less than 10% ).

 

0 Likes

I'm assuming you are getting more than the theoretical peak performance because you ASSUME that root square should be counted as 20 FLOP even though it doesn't take you 20 FLOP to calculate it.

It's been awhile since I've read that japanese paper, are you sure that's what they used? My memory wants to tell me no.

0 Likes

Originally posted by: ryta1203 I'm assuming you are getting more than the theoretical peak performance because you ASSUME that root square should be counted as 20 FLOP even though it doesn't take you 20 FLOP to calculate it.

 

It's been awhile since I've read that japanese paper, are you sure that's what they used? My memory wants to tell me no.

 

I'll answer this way 

With N = 226, 304, our optimized brute force method took roughly 2 seconds on one RV770 running at 750MHz. As far as we know, the performance we obtained is fastest ever with one GPU chip in April 2009.


And here is my result for 4770 ( for the same problem size )

execution time 1592.27 ms, classic GFLOPS 1255.63, modern GFLOPS 660.86


So yes, they use 38 flops per force count .

 

0 Likes

Originally posted by: hazeman
Originally posted by: Gipsel

But you have some additional overhead because you have to copy the particle data between all GPUs each timestep 

 



True but it's insignificant ( less than 10% ).

 



Also if you want to copy 100,000 particle records between 3 GPUs in 0.003 seconds? Remember he wanted 30fps. That may be possible, but it gets close I think.

I have no direct comparison as I track far more data per particle as would be needed by the most basic version like distance to nearest neighbour, index of nearest neighbour, particle type and charge (I have electrons and different ions, so one needs mass and charge), potential at the place of the particle and so on. That's needed because I've also a laser field and need to simulate ionization events during each time step (that means I need to modify the number of particles and their properties each time step). Furthermore I store important stuff like position and speed in two floats for each component as numerical stability is of utmost importance (next to speed). The benchmark is a double precision implementation after all and the task was to reach the necessary precision level without paying the performance penalty for actually using double precision.

To sum it up, for my amount of data per particle it would be impossible as I would need about 12 GB/s used PCI-Express bandwidth for 100,000 particles, 3 GPUs and 10% of the 0.03s for a time step allowed for data shuffling, which simply isn't there

0 Likes

For n-body problem you need to transfer 100K*32B per frame per gpu. So for 30 fps it's 96MB/s required transfer. As I've written before it's insignificant.

0 Likes

Originally posted by: hazeman For n-body problem you need to transfer 100K*32B per frame per gpu. So for 30 fps it's 96MB/s required transfer. As I've written before it's insignificant.


And I have mentioned that my version is significantly more complicated and I need much more data per particle

Actually you have to copy  (read and write) the data. And as you wanted to invest less than 10% of the time transferring data (which can't happen in parallel to the calculations or you have to split your kernel calls in smaller parts adding even more overhead) that figure would get already ~1 GB/s read and 1GB/s write bandwidth used simultaneously (or 2GB/s if read and writes cannot be overlapped and the data is buffered in host memory). Have you tried what bandwidth one gets for calls transferring only a low amount of data? From my experience, there is quite some overhead so one didn't get close to the peak bandwidth.

But fortunately, the particle velocities can stay on their respective GPU in a simple n-body simulation (wouldn't be possible in my case), you have only to distribute the positions. That would reduce the amount of copied data to 16 Bytes per particle. And with N GPUs you have only to copy (N-1)/N of the particle positions per GPU, but one needs to do (N-1)*N  copy calls in total, which means overhead will outweigh the transfers quite fast.

For 3 GPUs it would be 3.2 MB to read and 3.2 MB to write and that is splitted in 6 individual copy calls. One can reduce that even more if one buffers and concatenates the data in host memory. The absolute minimum would be 3 reads to host memory of 1/3 of the positions from each GPU (1.6 MB read in total) and 3 writes of 2/3 of the positions to each GPU (3.2 MB writes in total).

I don't know how severe the calling overhead for transfers or kernel calls is when done direcly in CAL, but from the high level alternatives it appears to be quite severe for small transfers/kernels. Once you got a sub millisecond time for a call, it won't get much faster if you reduce the processed or transferred amount further. The minimum latency appears to be something like some hundred µs, which means with 6 calls you get already close to the "allowed" 0.003 seconds without having done anything yet.

0 Likes

The code for n-body is available in example section of CAL++ library.

 

0 Likes
hazeman
Adept II

Originally posted by: dnorric Hi i have a 4570 and are not impressed at all with the n-body performance. The particle count seems limited to 1024 before the fps get too slow.


You should remember that SDK samples have been created to help others write simple programs in opencl.

Looking at the n-body code I can say that from gpu optimisation point of view it's simply baaaaad. I gues it's 10-100x slower than after ati specific optimisations.

Few main problems I've noticed

1. Usage of local memory - specially for 4xxx family it's bad ( no real local memory ). But even for 5xxx using images and TU cache would be much more efficient ( with scheduling work for max cache reuse ).

1a. Using barriers in code - on 4xxx these are sloow. Again using images would allow to remove it.

2. Bad choice of work items - grouping 5 current work items into one would give much better instruction slot filling.

3. Loop unroling - ATI's opencl compiler doesn't do loop unroling - specialy for n-body this is really important ( also this could help with point 2 ).

4. Using opencl - current opencl compiler isn't of best quality - using CAL/IL based approach gives significant improvements ( imho best to use CAL++ ).

Removing those problems would give huge speed up. But this is probably only a start .

 

0 Likes

Thanks heaps hazeman thats of great releaf. After your optimizations it sounds like i would be in the paritcle and fps ballparks

Cheers

Damian

0 Likes

dnorric,
Another issue is that the 4570 is a low-end product and not our highest performing chip, the highest performing single chip card(4890) has greater than 10x the theoretical peak of the 4570. You can view a good comparison of the various cards here:
http://en.wikipedia.org/wiki/C..._.28HD_4xxx.29_series

Most likely the youtube demo was using the highest performing card nvidia has to offer.

0 Likes

There was a japanese paper on gravitational many-body problem that was getting ~1 TFLOPs on the RV770.

0 Likes

To second Hazeman's suggestions, don't be overly guided by kernel designs running on nVidia hardware. The optimization principles can be very different. We've seen this with matrix multiplication - the fast Volkov memory buffer kernels for CUDA look very different from the image kernels that run fast on ATI hardware.

What I would do is start with something simple that works (but is slow) and then iteratively refine it. My experience is that image based kernel designs can be less painful than memory buffers. It's not quite as demanding. With buffers, you have to handle a memory hierarchy manually from global to local to registers and do lots of complex array subscript arithmetic. This is like solving brain teasers.

0 Likes

I have be browsing the forum now and it would seem anytime a comparrison between nvidia and ati come up it performance wise with nvidia always coming out on top its blamed on the code not being optimized.

Shore optimizationg is ideal however if an unoptimized piece of code runs faster on nvidia than does ati that suggests that nvidia hardware has something over the ati hardware.

 

0 Likes

Originally posted by: dnorric Shore optimizationg is ideal however if an unoptimized piece of code runs faster on nvidia than does ati that suggests that nvidia hardware has something over the ati hardware.


In a way you are right. Nvidia gpus design makes it much easier to achive peek performance ( smaller warp/wavefront size, better local memory, scalar vs 5 wide ops ). On the other hand ATI cards have much higher peak performance ( over 2x ) but it's real pain in the ... to squize it ( and sometimes simply not possible ).

One of the examples is pyrit where GTX 480 is doing 28k ( ~90% efficiency ) passwords/s and 5870 82k ( also ~90% ) passwords/s. And kernel for ATI card is quite different from the one for GTX.

0 Likes

Originally posted by: ryta1203 There was a japanese paper on gravitational many-body problem that was getting ~1 TFLOPs on the RV770.

 

Uses CAL

http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770

 

0 Likes

Originally posted by: moozoo
Originally posted by: ryta1203 There was a japanese paper on gravitational many-body problem that was getting ~1 TFLOPs on the RV770.

 

Uses CAL

http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770

 

That's correct. It perfectly shows what can be obtained on these cards when good code is written.

0 Likes

I dont seem to be able to download the code. Has anyone had any success

Cheers

Damian

0 Likes

Managed to download the binary from somewhere else. Very nice. I have a 4570. How much faster is the 5870

Cheers

Damian

0 Likes

Originally posted by: dnorric Managed to download the binary from somewhere else. Very nice. I have a 4570. How much faster is the 5870

 

Cheers

 

Damian

 

http://galaxy.u-aizu.ac.jp/trac/note/wiki/Tests_With_RV870

 

0 Likes

So basically the 4570 is 5 times slower than the 5870

 

0 Likes

Originally posted by: dnorric So basically the 4570 is 5 times slower than the 5870

 

 

 

It's much more...

 

0 Likes

Hazeman: ATI's OpenCL compiler is only good for playtime. For example matrixmul in CAL++ is doing 1.6TFLOPs vs OpenCL 1TFLOPs ( on 5870 ). So for any advanced code throwing away opencl is easiest optimisation step ( gives 20-30% without much thinking ).

This is of course true, just as C is faster than Java and Java is faster than Python. But many applications are limited by I/O in some way, either database, disk or network. That's why the tradeoff of productivity or lower maintenance costs with a higher level language can make sense. The real world system performance may not be very much different.

Some problems with very high arithmetic intensity like N-body problems and what I think Pyrit does benefit a lot from CAL++. They are compute bound. There is a small amount of data and a huge amount of computation.

Matrix multiplication, the level 3 core BLAS kernel, is only of moderate arithmetic intensity. Data is O(n*n) while computation is somewhere between quadratic and cubic (depends on what theoretical result you use). So the gain in performance with problem size is basically linear. This is exactly what I saw when I tried the simple case of including PCIe bus data transfer in benchmarks: http://golem5.org/gatlas/bench_sgemm/bench_sgemm_pcie.html . Sustained throughput is determined by the effective memory bandwidth between host and GPU device. So a faster PCIe bus or better blocking to use more device memory and minimize transfers is more important than absolute kernel speed.

For some problems, CAL++ will crush OpenCL. That will not change. It's just like C/C++ can crush Java. But I believe that there are also many problems where the balance of costs favours OpenCL. It depends on the problem.

0 Likes

It certainly also depends on the compiler, of which the current AMD GPU OpenCL compiler is not very good.

0 Likes

After lookin at the n-body example and posting here optimization tips I've had a feeling that it should be possible to squeeze all the juice out of ATI card.

My first step was to run some numbers.

- 5870 can do 850M*1600 ops,

- computing interaction between bodies takes 13 ops.

- we need to transfer 16B for 1 interaction

So 5870 can compute 104G interaction per sec and need to transfer 1673GB/s.

We see that our problem is memory transfer. It's over 10x higher than what's available ( ~150GB/s ).

We notice that for n-body problem if we group k particles, than we can read data for that group once. This reduces transfer k times.

For decreasing transfer from global memory L1 cache can be used. We will load data for whole work group once. After setting work group size to 256 transfer from global memory is limited to ~0.5GB/s ( not bad ).

But this doesn't solve our problem. L1 cache has agregate transfer of 1GB/s - and this is still not enough.

Most developers forget that there is one important difference between Nvidia and ATI cards. ATI cards have much bigger register file. And these registers can be used as ultimate cache !!!

So next step is to group particles in registers. Register file is much smaller than cache so we can have only 8 particles grouped in one work item. This reduce transfer requirement on L1 cache from 1.6TB/s to 210GB/s. Finally we are at home and we can start coding .

I've written kernel in CAL++ based on those calculations.

Using classical* flops count for force calculation (38 flops) 4770 is doing 1250 GFlops/s - this gives 90% peek ops efficiency. 4% goes for necessary memory index calculations. This gives total 94% used only for computations. For ATI card/IL compiler this is achiving GPU limit .

Here are estimated values for other cards.

4870 - 1580 GFLOPs (vs 1TFLOP achived by japanese CAL version)

5870 - 3580 GFLOPs.

Over the weekend I'll post the code in the example section of CAL++ library.

* In most n-body papers rsqrt is counted as 20 flops. To compare our results with other solutions we need to use this value.

PS. On 4xxx OpenCL n-body sample is 20x slower than described method. I think this is a good example how badly one can write kernel for gpu.

0 Likes

I did a similar optimisation exercise for N-body:

http://forum.beyond3d.com/showthread.php?p=1282242#post1282242

8 bodies with loop unrolling. Once it has been vectorised to avoid bandwidth bottlenecks, the only remaining optimisation is use of the symmetry of force calculations, i.e. force between any pair of bodies needs to be calculated only once, not twice.

0 Likes

Originally posted by: Jawed I did a similar optimisation exercise for N-body:

 

http://forum.beyond3d.com/showthread.php?p=1282242#post1282242

I think I've been able to squeeze gpu more tightly ( 660 GFLOPs using 20 flops count on slower 4770 ).

 

8 bodies with loop unrolling. Once it has been vectorised to avoid bandwidth bottlenecks, the only remaining optimisation is use of the symmetry of force calculations, i.e. force between any pair of bodies needs to be calculated only once, not twice.


I doubt it's possible to use symmetry efficiently on gpu. It would require transfering data computed in one part of the device to other part. The only way to do it is thru global memory - and writing on gpu is much slower ( and dont go thru cache ).

There is an option to use symetry in each simd ( using lds ). In theory it could give as much as 2.5% on 5870 card. This is so small that it would have been eaten by less efficient code.

I think that losing symmetry is simply a price we have to pay for using fast gpu.

0 Likes

hazeman,
What is the performance of running the nvidia kernel on ATI hardware versus running the ATI kernel on nvidia hardware?
0 Likes

Originally posted by: MicahVillmow hazeman, What is the performance of running the nvidia kernel on ATI hardware versus running the ATI kernel on nvidia hardware?


The exact data you ask for, aren't really available. I can give you partial data that people reported.

Almost the same kernel but CUDA vs OpenCL

GTX 480 - 28K pass/s (CUDA)

5870 - 41K pass/s (OpenCL)

Now the high performance kernel for ATI is written in CAL++ ( library for easy IL coding ). 5870 is doing 82K pass/s. There are few factors involved in performance improvement over opencl version.

- using bitalign for cyclic shift ( ~30-40% ) which isn't available on nvidia

- changing kernel design ( ~30-40% ) - the changes were made mostly to improve instruction slot ocupancy.

- removing opencl overhead

- some changes in pyrit main code.

So as you see it's not really possible to directly run ATI optimised kernel on Nvidia's hardware . Nonetheless it's possible to convert this kernel to CUDA ( with the exception of cyclic shift ). You could sponsor pyrit developers to do it .

There is a chance that Nvidia card would run out of registers with such a kernel ( which could lead to performance degradation ). Otherwise it should run with the same speed as current CUDA kernel.

 

0 Likes

hazeman,
have you looked at the amd_media_ops extension for OpenCL? This gives you access to bitalign instructions.
0 Likes

Originally posted by: MicahVillmow hazeman, have you looked at the amd_media_ops extension for OpenCL? This gives you access to bitalign instructions.


What for ? ATI's OpenCL compiler is only good for playtime. For example matrixmul in CAL++ is doing 1.6TFLOPs vs OpenCL 1TFLOPs ( on 5870 ). So for any advanced code throwing away opencl is easiest optimisation step ( gives 20-30% without much thinking ).

And btw OpenCL has rotate function - so instead of forcing developers to use extension you should optimize compiler !!!

 

0 Likes

hazeman,
Thank you for bringing rotate to my attention. If there are no spec issues then I'll get this mapping to be correct on the hardware that supports it.
0 Likes

Hazeman,
This has been added to the compiler for the next release. If you know any other suggestions on how to improve the compiler stack that we have missed, please let us know.
0 Likes