Dear Himanshu,
Sorry for replying this late to a year old question. I appreciate you taking time for the solution, but I solved it in a different way. My way might be a little faster, but I have to benchmark.
The reason I needed the unique member thing is coz I was trying to devise a way to classify stuff using an integer key. I could straightforwardly sort it, but I found a better solution.
What I am doing is this:
I take the key of the first element of the array. Then, I run a comparison and store the result. ie, if it is equal to the first element, I store 1. Esle I store 0. Now, I do a parallel scan on the result. This is really fast in CUDA using population count, but we dont have that feature in Opencl. I also do a parallel scan inverting the bits of the result.
Now, using the result of the first scan, I bin all elements equal to the first element in one array, and using the result of the second scan, I bin all elements not eual to the first element in the second array.
I repeat the process on second array, until no elements remain.
This solves the binning problem. Classifying elements this way is much faster than full radix sort if there are only a few unique keys.
The number of prefix sum iterations required is equal to twice the number of unique keys present in the array. However, for radix sort, its always equal to the number of bits in the key, which in my case was much larger. Also, the number of bits that need to be prefix summed reduces with each iteration in my algorithm, but stays constant(equal to the number of elements) in radix sort.
Regards,
Debdatta Basu