Archives Discussions

jtelf2 · ‎12-17-2008

Having read through the documentation and had a play with Stream I've got a couple of questions that aren't really covered in the documentation as far as I can see. My application is an evolutionary simulation (GA) so although the questions are general they've got that context.

1) To what extent is Stream intended to be used for process heavy code rather than pure calculations?For example, my GA has some code that involves processes that branch, but that could be run against each individual in the population in parallel. Is the branching likely to wipe out the advantages of parallel processing leaving me better off doing that off the GPU and only accelerating the calculation-heavy functions?

2) Each individual in my population is composed of a set of values whose fitness can be calculated independently. The number of values per individual is low (approx 300) but there are a large number of individuals (thousands) Is it likely to be faster to:

a) send a 3d stream of all values for all individuals to the GPU at once

b) loop over the individuals and send a 2d stream for each individually

If the answer is 'it depends', what does it depend on and how might I estimate the best performing approach in advance?

Thanks in advance for any replies.

twinclouds · ‎12-20-2008

Hi,

Looks like we are thinking about the same issues for the applications of interest. However, you probably have some experience already but I am just thinking about if I need to get a board myself. I hope you can give me some help.

I am thinking about to run some parallel programs on the processor. These programs are of medium comprexity involve in both computation and some logical operations. (BTW each of which will generate a random number each interation. The processor has 800 cores but I don't need to run 800 programs, maybe 20 or so will be more appropriate. The question is mainly, if compare to a conventional processor with a clock of, say 2GHz, how this processor compares when run the same program on one core.

Thanks in advance.

Ceq · ‎12-21-2008

Hi, I'm not an expert but I'm going to try to help giving my opinion:

1. It depends on the code, altough branching hurts performance, if all threads do aproximately the same amount of work you shouldn't worry too much.
You'll also need to find some way to use SIMD vectors as much as possible to obtain better performance.

2. Kernel invocations are slow, and the same goes for each streamRead / streamWrite you do, if your problem fits in the card's memory try to launch all at once.

Also note the processor has 800 cores, but typically you'll need much more tasks (> 80000) to get good performance as they're able to process several threads
in orther to hide memory and arithmetic latencies. If you think you only need about 20 threads a multiprocessor computer using OpenMP could be best suited for it.
However I think it's quite strange to have such a small population...

Maybe you'll find interesting to play around with Brook+ samples changing the number of iterations and the size of the input problems to see the behaviour of
the GPU, they're in Brook+ installation directory in the samples folder.

Archives Discussions

A couple of conceptual and performance questions