cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bubu
Adept II

sort library?

Is there a high-level OpenCL library to perform a quick sort on a __global array given a custom less/greater operator pls?

 

thx

0 Likes
10 Replies
redditisgreat
Journeyman III

There are no libraries in OpenCL but there is a radix sort in the ATI Steam examples which you can use.

0 Likes
LeeHowes
Staff

The radix sort (even if it were written for performance) wouldn't be any use because it's based on bit ordering only.

A custom comparison operator sort is an interesting idea. How would you prefer to see such a thing in an API? Maybe the ability to pass a stub of OpenCL comparison code to the library and allow it to compile you the kernel at run-time?

std::string compareFunc("int compare( void *a, void * b ){... do something with casts and stuff ... }");

initSort( compareFunc );

Any other thoughts?

0 Likes

A custom comparison operator sort is an interesting idea. How would you prefer to see such a thing in an API? Maybe the ability to pass a stub of OpenCL comparison code to the library and allow it to compile you the kernel at run-time?

I would like to see it implemented as:

 

typedef struct tMyStruct

{

   float a;

}MyStruct;

 

void MyCompFunc ( const __global void* a, const __global void* b )

{

    const MyStruct *l_pA ( (const MyStruct*)a );

    const MyStruct *l_pB ( (const MyStruct*)b );

    return ( l_pA->a < l_pA->b );

}

 

 

__kernel MyKernel ( __global float *data )

{

   sort ( data, 0, sizeof(MyStruct), 256, MyCompFunc ); /* 0==offset, sizeof element, 256==nElements */

}

 

The "sort" bulit-in function's call could be replaced internally with your implementation-specific code when the kernel is JIT-compiled.

 

Also I would like to see that to implement reductions with +, -, *, min/max operations ( or custom ones better ).

 

Alternatively, you could write a high-level API attached to the OpenCL spec defining some common functions like matrix multiplication, dots, FFT, etc...

so new host-functions like these could be defined:

 

clXQuickSort ( cl_mem buf, const size_t offset, const size_t sizeOfElement, const size_t nElements, const char* cmpFuncCLCode );

 

that "X" is to diff them from the "core" CL calls. X stands for "extended" or something like that...

0 Likes

Would you expect that sort call to only work on work-group level data? Or even on work-item level data? A global sort couldn't work in this case because there is no point of synchronisation given OpenCL's limited flexibility.

That is to say, would you expect that your 256 item sort be on a region of local data and synchronised above and below the sort call (or implicitly within it) so that all the work-items in the group can access the sorted buffer?

Given the current implementation your example would have to be done with macros, which would be unpleasant. Templates would help, certainly. I think most people would like to see C++ in OpenCL.

0 Likes

Originally posted by: LeeHowes Would you expect that sort call to only work on work-group level data? Or even on work-item level data? A global sort couldn't work in this case because there is no point of synchronisation given OpenCL's limited flexibility.


I would say for all ( wavefront, group AND global mem ).

I want this for __global memory data inside the same kernel launch ( the complete grid of data ).

Perhaps would be possible using the "clXQuickSort" method¡ then?

 

 

Templates would help, certainly. I think most people would like to see C++ in OpenCL.

Yep, definitively.

 

0 Likes

I don't know how you could expect that to synchronise. You'd have to end one kernel before the call and initiate another kernel after the call. In a high-level language that would make sense, but OpenCL is almost as low-level as it gets. From the host code you could do it with a passed-in compare string, though obviously we don't have a library that does it currently.

I agree that it would be nice to have, but I think a higher level parallel language would be a nice thing to have in general. I would say that OpenCL has too many restrictions for it to be used in this way without a major redesign, unfortunately. It's just not expressive enough.

0 Likes

Perhaps you could add these two functions:

 

1. sort. Built-in OpenCL language function to sort the workgroup's data, that is __local memory. Example of use:

__kernel void MyKernel ( ... )

{

    __local MyStruct data[256];

    data[get_local_id(0)] = ...;

 

    sort ( data, 0, sizeof(MyStruct), data, MyCmpFunc ); --> this will force a sync mem/exec_fence

 

   ... continue making ops with the data sorted...

}

 

This kind of built-in function is only targeted at workgroups.

Same for +, -, *, min/max reductions.

 

2. Create a Khronos API called "OpenCL Performance Primitives" and place there high-level mathematical functions/algorithms like qsort, reductions, FFT, SGEMM, etc...

There, put this one:

 

clXQuickSort ( cl_mem bufDataToSort, const size_t offset, const size_t elementSize, const size_t nElements, const char* cmpFuncCode );

 

the "cmpFuncCode" could be something like

 

typedef struct tMyStruct

{

   float a;

}MyStruct;

 

void MyCompFunc ( const __global void* a, const __global void* b )

{

    const MyStruct *l_pA ( (const MyStruct*)a );

    const MyStruct *l_pB ( (const MyStruct*)b );

    return ( l_pA->a < l_pA->b );

}

 

so, internally, your drivers add the CmpFunc to your predefined code, JIT-compile the kernel, split the data into optimal blocks, call sort kernels, gather results, sync, reorder until all the data is sorted, etc...

 

This QSort( and Reduction) function is called from the host CPU and targeted at sorting CL __global GPU buffers.

 

Same for reductions.

 

If you don't like the idea to define a high-level API at Khronos perhaps you could implement it as a CL's extension.

 

In my concrete case, I need this to sort by-price a huge database or to find the 10 cheapest articles in a price range ( which is performed currently on CPU using almost 1 second ! ). If the GPU could sort a _global buffer in less time calling a simple function then it will be a revolution !

 

Ofc, I could it myself but you know better than me your internal architecture... so you can implement these relatively high-levels functions like qsort/SGEMM,etc... avoiding local memory's bank conflicts, controlling the GPU's occupancy, etc... much better than I can do.

 

0 Likes

Wouldn't a mixed strategy do the job? What I mean is, let's say you want to sort using the field Value of a structure s. Then, to sort a list S[] of s:

- Create a value array and store s.Value in each array O(n);

- Copy to GPU memory O(n);

- Sort in GPU memory, creating a permutation vector p - O(n log n);

- Retrieve the permutation vector p O(n);

- Access S[p] from now on, S[p] being the ordered list.


How about this?


0 Likes

Sorting data that is not already resident on the GPU memory is a waste of time since PCIe transfer rates are too low.

What kind of data, do you want to sort and how many keys are there?

If you read the keys from a DB you are more likely memory bandwidth constrained.

 

Regarding the general OCL sorting idea, I propose achieving global synchronisation using atomic operation on global mutexes to synchronize writing to an output array.

I have some ideas, if I find the time I will implement an ineger sort that can be generalized later on.

0 Likes

I'm trying to adapt the bitonic/radix sort examples. Some questions:

 

1. What I should do to work with 24-bits indices in the radix sort example? Change the RADIX from 8 to 24 and re-compile? Cannot be done using the command line?

 

2. What's faster? The bitonic sort or the radix sort?

 

3. Is the bitonic sort example's kernel using global memory coalescing?

 

0 Likes