Hi everybody,
On the CAL/cuda front, I've basically been attempting to port my cuda Cholesky factorization program to AMD cards with CAL/IL, hoping for a significant performance improvement, especially in double precision. As it turns out it has taken significantly longer to get working and so far the performance is not competitive: my cuda routine does 100 GFLOP/s SP and 40 GFLOP/s DP on a GTX260, whereas the best IL one so far does 30 GFLOP/s SP and I haven't even tried DP on a 1GB 4870. The algorithm requires lots (but not prohibitive) amounts of memory access, read and write, and I think the key problem is that AMD have not released sufficient information to date on how to access memory effectively and reuse it 100% effectively (via the caches), both within and across wavefronts, making an informed implementation impossible. Nvidia cards with their straightforward uncached global memory and register-quick shared memory seem to "just work" much better in this regard. My overall impression so far is that the AMD setup is "fussier" and so it is much harder to achieve really good performance. (Take for example global buffers: you can only have one, from a kernel you can't in general read from it efficiently, and from linux 64-bit at least, it has to be at least 256MB smaller than the size of memory on the card and you can't easily access it from the cpu if it's greater than 255MB. With cuda I can simply declare and use any number of arrays, and a single one can be up to 50MB of the total memory of the card.) This is not to say I haven't had problems with Nvidia -- I have noticed significant (factor of 2) slowdowns at certain matrix sizes related (I think) to the -undocumented- way in which memory is split between the multiple 64-bit channels of their cards, and also seen a significant (50%) improvement by pre-"blocking" the matrix in global memory -- perhaps related to -undocumented- paging effects.
In fact I think that both companies need to provide much more optimization information to make using gpu's worthwhile even as part of a supercomputer. To beat a modern multiway multicore cpu node (especially if you can exploit a vendor-supplied library...) the gpu's have to be doing close to optimal, and the latter is very hard to achieve at present for a non-"pure streaming" application.
From some of Micah's comments in other threads though I am hopeful for significant documentation and compiler improvements shortly so plan to try the CAL/IL option again then!
Best,
Steven.