I'm currently working on the CLPP library (http://code.google.com/p/clpp/) and mainly on the CPU implementation of the 'sort' algorithm.
I currently have a 'special' sort algorithm for the CPU only but after launching the benchmark I have
see that the std::sort algorithm is faster than the OpenCL one.
So, it simply mean that it remain some work to have a really fast sort on the CPU. I would like to know if the AMD Engineers has some councils to give, some papers to reference etc...
An important part would be use this algorithm on the Fusion APU too !