Recently, I started to analysing the execution time of different algorithms on GPGPU. But I found the results of sample "BitonicSort" in SDK are much larger than the results I saw from some papers and documents. For instance, the kernel time of sorting a array with lenth 8388608 is 27s on GPU, and 46s on CPU(speedup is just approximately 2), but only 8s on STL. I was so confused that paralell sorting on GPU is such slower than STL. Could anyone tell me why? Bty, I also don't understand why the complexity of this sample, which is mentioned in the sample's doc, is O( N * log2(N) * log2(N) ), not ( log2(N) * log2(N) ). And I'll be very happy if someone can provide me a better design of Bitonic Sort.
My computer is a laptop with Intel Core i7 and ATI Mobility Radeon HD 5870.
The OpenCL SDK samples are not, for the most part, optimized for performance and that is certainly true of the sorting. Do not use them as best case examples for how to do sorting on the GPU. We have a better performance sort that Takahiro Harada and I put together which did not make it into the book but is at the Heterogeneous Computing with OpenCL web site at: http://www.heterogeneouscompute.org/?page_id=7 It's not updated for latest hardware, but not too far off.
As for complexity, I'm not sure off the top of my head, I'd have to think the algorithm through.