Archives Discussions

spectral · ‎08-09-2011

Hi,

I'm currently working on the CLPP library (http://code.google.com/p/clpp/) and mainly on the CPU implementation of the 'sort' algorithm.

I currently have a 'special' sort algorithm for the CPU only but after launching the benchmark I have

see that the std::sort algorithm is faster than the OpenCL one.

So, it simply mean that it remain some work to have a really fast sort on the CPU. I would like to know if the AMD Engineers has some councils to give, some papers to reference etc...

An important part would be use this algorithm on the Fusion APU too !

Thanks

Krys

LeeHowes · ‎08-09-2011

I'm sure you meant AMD... 😉

Is there something that makes you feel that std::sort uses a particularly bad algorithm such that you are likely able to do better? Or is there a particular reason why you want to re-implement algorithms for the CPU?

I'm sure there must be lots of papers on efficient CPU sorting, I don't see any reason to think that OpenCL changes that. Even if we supported vectorisation across work items, which we hopefully will at some point, I don't think that would help for SSE/AVX because sort would be scatter/gather bound and you'd end up losing too much efficiency from the vectors. With LNI and maybe future AVX that's a different story and with work-item packing you might want to look at a more GPU-oriented algorithm to efficiently use the vectors.

spectral · ‎08-10-2011

Hi Lee,

Often we have the data on the 'OpenCL' side, if I want to use std::sort the I must :

1 - read data from cl_mem
2 - use std::sort
3 - write back the sorted datas

So, sorting done in OpenCL is better for OpenCL software !

I don't know how the std::sort work but I all I want is to have maximum performance in my sorting algorithm !

Also, I have see some libraries (intel performance primitives) that have a very fast sorting algorithm, but it is not OpenCL !!!!

Anyway sorting is a very used algoritm for a lot of OpenCL software... so it will be usefull to work on this. Take a look at CUDA, today they deliver CUDA with Thrust... and Thrust is developed in house because it is the only way to provide all the "building blocks" for applications.

LeeHowes · ‎08-10-2011

Oh I agree, there needs to be a good set of OpenCL libraries.

What I meant for CPU sorting, though, was that if there is nothing wrong with what std::sort does you might as well implement the same algorithm. It's probably some combination of quicksort locally and mergesort globally or something along those lines given papers that I have seen.

Let me go back a step, looking at clEnqueueNativeKernel:

http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueNativeKernel.html

The spec says:

The data pointed to by args and cb_args bytes in size will be copied and a pointer to this copied region will be passed to user_func. The copy needs to be done because the memory objects (cl_mem values) that args may contain need to be modified and replaced by appropriate pointers to global memory. When clEnqueueNativeKernel returns, the memory region pointed to by args can be reused by the application.

And:

A pointer to appropriate locations that args points to where memory object handles (cl_mem values) are stored. Before the user function is executed, the memory object handles are replaced by pointers to global memory.

So I think that while args will be copied, that may just be the pointers. The cl_mem objects aren't guaranteed to be copied - just guaranteed to be synchronised correctly (I presume, given the rest of the spec with queues and so on and given the x86 memory model). So if you pass your cl_mem objects to a native function, they will be passed in as pointers to your function and you can call std::sort (or ACML sort or any of the sorts Intel releases) on the host and reuse the optimisations other people have done.

Archives Discussions

CPU sort algorithm