Looks like we are thinking about the same issues for the applications of interest. However, you probably have some experience already but I am just thinking about if I need to get a board myself. I hope you can give me some help.
I am thinking about to run some parallel programs on the processor. These programs are of medium comprexity involve in both computation and some logical operations. (BTW each of which will generate a random number each interation. The processor has 800 cores but I don't need to run 800 programs, maybe 20 or so will be more appropriate. The question is mainly, if compare to a conventional processor with a clock of, say 2GHz, how this processor compares when run the same program on one core.
Thanks in advance.
Hi, I'm not an expert but I'm going to try to help giving my opinion:
1. It depends on the code, altough branching hurts performance, if all threads do aproximately the same amount of work you shouldn't worry too much.
You'll also need to find some way to use SIMD vectors as much as possible to obtain better performance.
2. Kernel invocations are slow, and the same goes for each streamRead / streamWrite you do, if your problem fits in the card's memory try to launch all at once.
Also note the processor has 800 cores, but typically you'll need much more tasks (> 80000) to get good performance as they're able to process several threads
in orther to hide memory and arithmetic latencies. If you think you only need about 20 threads a multiprocessor computer using OpenMP could be best suited for it.
However I think it's quite strange to have such a small population...
Maybe you'll find interesting to play around with Brook+ samples changing the number of iterations and the size of the input problems to see the behaviour of
the GPU, they're in Brook+ installation directory in the samples folder.