I'm sure you meant AMD... ;)
Is there something that makes you feel that std::sort uses a particularly bad algorithm such that you are likely able to do better? Or is there a particular reason why you want to re-implement algorithms for the CPU?
I'm sure there must be lots of papers on efficient CPU sorting, I don't see any reason to think that OpenCL changes that. Even if we supported vectorisation across work items, which we hopefully will at some point, I don't think that would help for SSE/AVX because sort would be scatter/gather bound and you'd end up losing too much efficiency from the vectors. With LNI and maybe future AVX that's a different story and with work-item packing you might want to look at a more GPU-oriented algorithm to efficiently use the vectors.
Often we have the data on the 'OpenCL' side, if I want to use std::sort the I must :
1 - read data from cl_mem
2 - use std::sort
3 - write back the sorted datas
So, sorting done in OpenCL is better for OpenCL software !
I don't know how the std::sort work but I all I want is to have maximum performance in my sorting algorithm !
Also, I have see some libraries (intel performance primitives) that have a very fast sorting algorithm, but it is not OpenCL !!!!
Anyway sorting is a very used algoritm for a lot of OpenCL software... so it will be usefull to work on this. Take a look at CUDA, today they deliver CUDA with Thrust... and Thrust is developed in house because it is the only way to provide all the "building blocks" for applications.
Oh I agree, there needs to be a good set of OpenCL libraries.
What I meant for CPU sorting, though, was that if there is nothing wrong with what std::sort does you might as well implement the same algorithm. It's probably some combination of quicksort locally and mergesort globally or something along those lines given papers that I have seen.
Let me go back a step, looking at clEnqueueNativeKernel:
The spec says:
The data pointed to by
cb_argsbytes in size will be copied and a pointer to this copied region will be passed to
user_func. The copy needs to be done because the memory objects (
argsmay contain need to be modified and replaced by appropriate pointers to global memory. When
clEnqueueNativeKernelreturns, the memory region pointed to by
argscan be reused by the application.
A pointer to appropriate locations that
argspoints to where memory object handles (
cl_memvalues) are stored. Before the user function is executed, the memory object handles are replaced by pointers to global memory.
So I think that while args will be copied, that may just be the pointers. The cl_mem objects aren't guaranteed to be copied - just guaranteed to be synchronised correctly (I presume, given the rest of the spec with queues and so on and given the x86 memory model). So if you pass your cl_mem objects to a native function, they will be passed in as pointers to your function and you can call std::sort (or ACML sort or any of the sorts Intel releases) on the host and reuse the optimisations other people have done.