Hi all. I've been putting together a big book-length online GPU computing tutorial. It's at http://www.moderngpu.com/
The content is very scan/reduction-centric, and the tutorials are built around some advanced case studies. I find that most GPU docs are pretty deficient in covering the very interesting and essential theory of inter-thread communication. Internalizing scan helps you tackle a much wider variety of problems without losing your mind in details.
The first case study is a radix sort I wrote:
The code sorts 1.31 billion 32-bit keys per second on a GTX 570. (!) It runs 30-40% faster than the current B40C development code (655 in the SVN). It's 60% faster than thrust::sort for 32-bit keys. I don't know of any pure radix sort that is faster. The code was written to be easily documented, and I cover pretty much every line from base principles, and show radix sort as an exercise in designing sophisticated hierarchical scans. It follows pretty closely the Satish/Harris/Garland radix sort. I note where my implementation deviates significantly from Satish, as that is a good reference. The code is relatively simple (compared to what is possible) and uses no templates . It's a basic count-histogram-sort implementation with five kernels total (the histogram has kernels for upsweep, reduce, downsweep).
The code is currently target for Fermi architecture with CUDA. Most of the content should be portable to other platforms.
I have been looking around for a good OpenCL/AMD section to add to the tutorial after the radix sort and sparse matrix (sparse is still under construction). I might try porting the sort to OpenCL/AMD (I did actually start GPGPU on a 5850 before moving to a GTX 570), but I am not sure what I'm up against. Is there a list of the current performance leaders for basic CS algorithms (sort, sparse matrix, dense matrix, fft, mersenne twister, etc) for CL/AMD? I'd really like to know what throughputs need to be met to get a best-in-class algorithm.
I don't think that current VLIW arch is going to be able to beat Fermi's scalar arch for most of these scan-heavy algorithms, although I could still be surprised. I am looking forward with great anticipation to Graphic Core Next/Southern Islands, and think that a sort of 2.5billion keys/sec on the 2048-ALU 7970 is achievable. The thing (if the rumors are true) is looking like a beast.
Also to the GPU computing team at AMD, I really hope that when GCN devices come out you provide an assembler and ISA documents for it. I've got some language design ideas I've been pushing around and I think SI would be an awesome platform.