I want to develop an efficient BWT implementation on Radeon GPGPU. For that I need an efficient sorting procedure. I've found Bitonic Sort and Radix Sort suitable to GPGPU.
Unfortunately it seems that both OpenCL implementations are slower than CUDA ones and Radeons are slower than GeForces at sorting:
I'm thinking about some hybrid of Radix Sorting and Bitonic Sorting. What are your suggestions? Where should I start (re)searching?
And another question. Radeon's Stream Processors are grouped to 4 or 5-elements wide units. Are they able to load uncontiguous data from registers? Ie. one SP reads or writes some 32-bit value, in parallel second SP reads/ writes another 32-bit value that doesn't have adjacent address, in parallel third, fourth (and maybe fifth) SP from unit does the same thing.
Or, to put it in another perspective, would that code be executed in one instruction and memory operation would be executed paralelly?
a += array[x];
b += array[y];
c += array[z];
d += array[w];
Or another code:
What level of memory parallelism should I expect?