Hi all. I've been putting together a big book-length online GPU computing tutorial. It's at http://www.moderngpu.com/
The content is very scan/reduction-centric, and the tutorials are built around some advanced case studies. I find that most GPU docs are pretty deficient in covering the very interesting and essential theory of inter-thread communication. Internalizing scan helps you tackle a much wider variety of problems without losing your mind in details.
The first case study is a radix sort I wrote:
The code sorts 1.31 billion 32-bit keys per second on a GTX 570. (!) It runs 30-40% faster than the current B40C development code (655 in the SVN). It's 60% faster than thrust::sort for 32-bit keys. I don't know of any pure radix sort that is faster. The code was written to be easily documented, and I cover pretty much every line from base principles, and show radix sort as an exercise in designing sophisticated hierarchical scans. It follows pretty closely the Satish/Harris/Garland radix sort. I note where my implementation deviates significantly from Satish, as that is a good reference. The code is relatively simple (compared to what is possible) and uses no templates . It's a basic count-histogram-sort implementation with five kernels total (the histogram has kernels for upsweep, reduce, downsweep).
The code is currently target for Fermi architecture with CUDA. Most of the content should be portable to other platforms.
I have been looking around for a good OpenCL/AMD section to add to the tutorial after the radix sort and sparse matrix (sparse is still under construction). I might try porting the sort to OpenCL/AMD (I did actually start GPGPU on a 5850 before moving to a GTX 570), but I am not sure what I'm up against. Is there a list of the current performance leaders for basic CS algorithms (sort, sparse matrix, dense matrix, fft, mersenne twister, etc) for CL/AMD? I'd really like to know what throughputs need to be met to get a best-in-class algorithm.
I don't think that current VLIW arch is going to be able to beat Fermi's scalar arch for most of these scan-heavy algorithms, although I could still be surprised. I am looking forward with great anticipation to Graphic Core Next/Southern Islands, and think that a sort of 2.5billion keys/sec on the 2048-ALU 7970 is achievable. The thing (if the rumors are true) is looking like a beast.
Also to the GPU computing team at AMD, I really hope that when GCN devices come out you provide an assembler and ISA documents for it. I've got some language design ideas I've been pushing around and I think SI would be an awesome platform.
I'm not keen on your hardware discussion (and wrote a whole book chapter on the subject) and think it's dangerous to explain it that way and to describe CUDA "threads" as legitimately MIMD (given the chance for deadlock that arises from doing that).It strikes me as much better to educate people that it's a vector architecture because that's what all your scan optimisations rely on anyway and this is how we always approached sort and scan both here and at NVIDIA.
However, if you're getting that kind of sort speed you have done a tremendous job on optimising it. My colleague and I will be pushing out a sort in the near future (I've been meaning to do this for a long time and have said so repeatedly on the forums... lack of time, unfortunately). We will obviously look at it again for the next gen. It is fairly fast on Cayman, but not quite up to your speed. On the next gen... well, watch this space 😉 We'll see what we can do. It should certainly be fast so good luck optimising for it once you get a card.
I agree with you on MIMD/SIMD, but find the SIMD term already so overloaded. Internally both GPU vendors implement SIMD, but the software interface appears MIMD, in contrast to something like Larrabee, which both is implemented as and looks like SIMD. GPUs can accurately be called SIMT, to indicate their latency hiding potential, but this doesn't convey that their SIMT threads are also cooperative, and so doesn't distinguish it from something earlier SIMT machines UltraSPARC T1. But yes I think I have (and NVIDIA too) abused the MIMD term to distinguish GPUs from SSE-like SIMD.
One of my objections to the design of CUDA and OpenCL is that it is very MIMD.. There's no language mechanism for indicating if a certain branch is warp-divergent or uniform over warps or uniform over the entire block, even though some ISAs expose different implementations for all three of those (eg CUDA often generates a branch plus predication to cover all bases).
I tried to be clear that GPU has SIMD implementation and appears MIMD to software, but I think I will rewrite some of the introductory material to better make that distinction.
Thanks for your comments. Also with respect to sort/scan, I've found the prmt instruction (gather any four bytes from two 32-bit operands) to be of incredible utility when packing multiple sequences into registers to accelerate scans. I could probably credit this with adding 20% throughput to the sort. If SI/GCN has this instruction it will be unstoppable with GPGPU, since it's likely to have an ALU count advantage over Kepler.
It is true that the AMD/NV ILs are lane-wise and LNI isn't - I think that's independent of the high level language, though. I can see why you would complain about CUDA/OpenCL sharing that design but I don't think there's a significant problem for compiling OpenCL to LNI or a vector high level to AMD IL or PTX (with full pointer and unstructured flow support, of course).
Use of the prmt instruction is interesting. I shall have to look at what we do have in the instruction set for the next gen and how that kind of thing will be used. Anyway, good word for now! Hopefully you can do something similar for AMD soon.
I'm reading chapters: Scans, Radix Sort. It's a very good learning material. Even that I never had an NVIDIA, I liked the way how you optimized out those @!P0 instructions.
Just wanted to ask that have you (including LeeHowes) optimized radix sort on GCN? If so, what times you've got.